Skip Headers

Oracle Text Reference
Release 9.2

Part Number A96518-01
Go To Documentation Library
Home
Go To Product List
Book List
Go To Table Of Contents
Contents
Go To Index
Index

Master Index

Feedback

Go to previous page Go to next page

2
Indexing

This chapter describes the various elements you can use to create your Oracle Text index.

The following topics are discussed in this chapter:

Overview

When you use CREATE INDEX to create an index or ALTER INDEX to manage an index, you can optionally specify indexing preferences, stoplists, and section groups in the parameter string. Specifying a preference, stoplist, or section group answers one of the following questions about the way Oracle indexes text:

Preference Class Answers the Question

Datastore

How are your documents stored?

Filter

How can the documents be converted to plain text?

Lexer

What language is being indexed?

Wordlist

How should stem and fuzzy queries be expanded?

Storage

How should the index tables be stored?

Stop List

What words or themes are not to be indexed?

Section Group

Is querying within sections enabled, and how are the document sections defined?

This chapter describes how to set each preference. You enable an option by creating a preference with one of the types described in this chapter.

For example, to specify that your documents are stored in external files, you can create a datastore preference called mydatastore using the FILE_DATASTORE type. You specify mydatastore as the datastore preference in the parameter clause of CREATE INDEX.

Creating Preferences

To create a datastore, lexer, filter, wordlist, or storage preference, you use the CTX_DDL.CREATE_PREFERENCE procedure and specify one of the types described in this chapter. For some types, you can also set attributes with the CTX_DDL.SET_ATTRIBUTE procedure.

To create a stoplists, use CTX_DDL.CREATE_STOPLIST. You can add stopwords to a stoplist with CTX_DDL.ADD_STOPWORD.

To create section groups, use CTX_DDL.CREATE_SECTION_GROUP and specify a section group type. You can add sections to section groups with CTX_DDL. ADD_ZONE_SECTION or CTX_DDL.ADD_FIELD_SECTION.

Datastore Types

Use the datastore types to specify how your text is stored. To create a datastore preference, you must use one of the following datastore types:

Datastore Type Use When

DIRECT_DATASTORE

Data is stored internally in the text column. Each row is indexed as a single document.

MULTI_COLUMN_DATASTORE

Data is stored in a text table in more than one column. Columns are concatenated to create a virtual document, one per row.

DETAIL_DATASTORE

Data is stored internally in the text column. Document consists of one or more rows stored in a text column in a detail table, with header information stored in a master table.

FILE_DATASTORE

Data is stored externally in operating system files. Filenames are stored in the text column, one per row.

NESTED_DATASTORE

Data is stored in a nested table.

URL_DATASTORE

Data is stored externally in files located on an intranet or the Internet. Uniform Resource Locators (URLs) are stored in the text column.

USER_DATASTORE

Documents are synthesized at index time by a user-defined stored procedure.

DIRECT_DATASTORE

Use the DIRECT_DATASTORE type for text stored directly in the text column, one document per row. DIRECT_DATASTORE has no attributes.

The following columns types are supported: CHAR, VARCHAR, VARCHAR2, BLOB, CLOB, BFILE, or XMLType.


Note:

If your column is a BFILE, you must grant read permission to CTXSYS on all directories used by the BFILEs.


DIRECT_DATASTORE CLOB Example

The following example creates a table with a CLOB column to store text data. It then populates two rows with text data and indexes the table using the system-defined preference CTXSYS.DEFAULT_DATASTORE.

create table mytable(id number primary key, docs clob); 

insert into mytable values(111555,'this text will be indexed');
insert into mytable values(111556,'this is a direct_datastore example');
commit;

create index myindex on mytable(docs) 
  indextype is ctxsys.context 
  parameters ('DATASTORE CTXSYS.DEFAULT_DATASTORE');

MULTI_COLUMN_DATASTORE

Use this datastore when your text is stored in more than one column. During indexing, the system concatenates the text columns and indexes the text as a single document.

MULTI_COLUMN_DATASTORE has the following attributes:

Attribute Attribute Value

columns

Specify a comma separated list of columns to be concatenated during indexing. You can also specify any expression allowable for the select statement column list for the base table. This includes expressions, PL/SQL functions, column aliases, and so on.

NUMBER and DATE column types are supported. They are converted to text before indexing using the default format mask. The TO_CHAR function can be used in the column list for formatting.

RAW and BLOB columns are directly concatenated as binary data.

LONG, LONG RAW, NCHAR, and NCLOB, nested table columns and collections are not supported.

The column list is limited to 500 bytes.

Indexing and DML

To index, you must create a dummy column to specify in the CREATE INDEX statement. This column's contents are not made part of the virtual document, unless its name is specified in the columns attribute.

The index is synchronized only when the dummy column is updated. You can create triggers to propagate changes if needed.

MULTI_COLUMN_DATASTORE Security

Only CTXSYS is allowed to create preferences for the MULTI_COLUMN_DATASTORE type. Any other user who attempts to create a MULTI_COLUMN_DATASTORE preference receives an error.

Oracle makes this restriction because when the columns attribute contains a function call, the call is made by the CTXSYS schema. The potential exists for a malicious CTXAPP users to execute arbitrary functions for which they do not have execute permission.

If this is too restrictive, you can create a stored procedure under CTXSYS to create MULTI_COLUMN_DATASTORE preferences. The effective user is CTXSYS, who creates and owns the preferences. However, you can call this procedure from any schema as CTXSYS.

MULTI_COLUMN_DATASTORE Example

The following example creates a multi-column datastore preference called my_multi with three text columns:

begin


ctx_ddl.create_preference('my_multi', 'MULTI_COLUMN_DATASTORE');
ctx_ddl.set_attribute('my_multi', 'columns', 'column1, column2, column3');
end;

Tagging Behavior

During indexing, the system creates a virtual document for each row. The virtual document is composed of the contents of the columns concatenated in the listing order with column name tags automatically added. For example:

create table mc(id number primary key, name varchar2(10), address varchar2(80));
insert into mc values(1, 'John Smith', '123 Main Street');

exec ctx_ddl.create_preference('mymds', 'MULTI_COLUMN_DATASTORE');
exec ctx_ddl.set_attibute('mymds', 'columns', 'name, address');

This produces the following virtual text for indexing:

<NAME>
John Smith
</NAME>
<ADDRESS>
123 Main Street
</ADDRESS>

The system indexes the text between the tags, ignoring the tags themselves.

Indexing Columns as Sections

To index these tags as sections, you can optionally create field sections with the BASIC_SECTION_GROUP.


Note:

No section group is created when you use the MULTI_COLUMN_DATASTORE. To create sections for these tags, you must create a section group.


When you use expressions or functions, the tag is composed of the first 30 characters of the expression unless a column alias is used.

For example, if your expression is as follows:

exec ctx_ddl.set_attibute('mymds', 'columns', '4 + 17');

then it produces the following virtual text:

<4 + 17>
21
</4 + 17>

If your expression is as follows:

exec ctx_ddl.set_attibute('mymds', 'columns', '4 + 17 col1');

then it produces the following virtual text:

<col1>
21
<col1>

The tags are in uppercase unless the column name or column alias is in lowercase and surrounded by double quotes. For example:

exec ctx_ddl.set_attibute('mymds', 'COLUMNS', 'foo');

produces the following virtual text:

<FOO>
content of foo
</FOO>

For lowercase tags, use the following:

exec ctx_ddl.set_attibute('mymds', 'COLUMNS', 'foo "foo"');

This expression produces:

<foo>
content of foo
</foo>

DETAIL_DATASTORE

Use the DETAIL_DATASTORE type for text stored directly in the database in detail tables, with the indexed text column located in the master table.

DETAIL_DATASTORE has the following attributes:

Attribute Attribute Value

binary

Specify TRUE for Oracle to add no newline character after each detail row.

Specify FALSE for Oracle to add a newline character (\n) after each detail row automatically.

detail_table

Specify the name of the detail table (OWNER.TABLE if necessary)

detail_key

Specify the name of the detail table foreign key column(s)

detail_lineno

Specify the name of the detail table sequence column.

detail_text

Specify the name of the detail table text column.

Synchronizing Master/Detail Indexes

Changes to the detail table do not trigger re-indexing when you synchronize the index. Only changes to the indexed column in the master table triggers a re-index when you synchronize the index.

You can create triggers on the detail table to propagate changes to the indexed column in the master table row.

Example Master/Detail Tables

This example illustrates how master and detail tables are related to each other.

Master Table Example

Master tables define the documents in a master/detail relationship. You assign an identifying number to each document. The following table is an example master table, called my_master:

Column Name Column Type Description

article_id

NUMBER

Document ID, unique for each document (Primary Key)

author

VARCHAR2(30)

Author of document

title

VARCHAR2(50)

Title of document

body

CHAR(1)

Dummy column to specify in CREATE INDEX


Note:

Your master table must include a primary key column when you use the DETAIL_DATASTORE type.


Detail Table Example

Detail tables contain the text for a document, whose content is usually stored across a number of rows. The following detail table my_detail is related to the master table my_master with the article_id column. This column identifies the master document to which each detail row (sub-document) belongs.

Column Name Column Type Description

article_id

NUMBER

Document ID that relates to master table

seq

NUMBER

Sequence of document in the master document defined by article_id

text

VARCHAR2

Document text

Detail Table Example Attributes

In this example, the DETAIL_DATASTORE attributes have the following values:

Attribute Attribute Value

binary

TRUE

detail_table

my_detail

detail_key

article_id

detail_lineno

seq

detail_text

text

You use CTX_DDL.CREATE_PREFERENCE to create a preference with DETAIL_DATASTORE. You use CTX_DDL.SET_ATTRIBUTE to set the attributes for this preference as described earlier. The following example shows how this is done:

begin


ctx_ddl.create_preference('my_detail_pref', 'DETAIL_DATASTORE');
ctx_ddl.set_attribute('my_detail_pref', 'binary', 'true');
ctx_ddl.set_attribute('my_detail_pref', 'detail_table', 'my_detail');
ctx_ddl.set_attribute('my_detail_pref', 'detail_key', 'article_id');
ctx_ddl.set_attribute('my_detail_pref', 'detail_lineno', 'seq');
ctx_ddl.set_attribute('my_detail_pref', 'detail_text', 'text');
end;
Master/Detail Index Example

To index the document defined in this master/detail relationship, you specify a column in the master table with CREATE INDEX. The column you specify must be one of the allowable types.

This example uses the body column, whose function is to allow the creation of the master/detail index and to improve readability of the code. The my_detail_pref preference is set to DETAIL_DATASTORE with the required attributes:

CREATE INDEX myindex on my_master(body) indextype is ctxsys.context 
parameters('datastore my_detail_pref');

In this example, you can also specify the title or author column to create the index. However, if you do so, changes to these columns will trigger a re-index operation.

FILE_DATASTORE

The FILE_DATASTORE type is used for text stored in files accessed through the local file system.

FILE_DATASTORE has the following attribute(s):

Attribute Attribute Values

path

path1:path2: :pathn

path

Specify the full directory path name of the files stored externally in a file system. When you specify the full directory path as such, you need only include file names in your text column.

You can specify multiple paths for path, with each path separated by a colon (:). File names are stored in the text column in the text table.

If you do not specify a path for external files with this attribute, Oracle requires that the path be included in the file names stored in the text column.

FILE_DATASTORE Example

This example creates a file datastore preference called COMMON_DIR that has a path of /mydocs:

begin
 ctx_ddl.create_preference('COMMON_DIR','FILE_DATASTORE');
 ctx_ddl.set_attribute('COMMON_DIR','PATH','/mydocs');
end;

When you populate the table mytable, you need only insert filenames. The path attribute tells the system where to look during the indexing operation.

create table mytable(id number primary key, docs varchar2(2000)); 
insert into mytable values(111555,'first.txt');
insert into mytable values(111556,'second.txt');
commit;

Create the index as follows:

create index myindex on mytable(docs)
  indextype is ctxsys.context
  parameters ('datastore COMMON_DIR'); 

URL_DATASTORE

Use the URL_DATASTORE type for text stored:

You store each URL in a single text field.

URL Syntax

The syntax of a URL you store in a text field is as follows (with brackets indicating optional parameters):

[URL:]<access_scheme>://<host_name>[:<port_number>]/[<url_path>]

The access_scheme string you specify can be either ftp, http, or file. For example:

http://mymachine.us.oracle.com/home.html

As this syntax is partially compliant with the RFC 1738 specification, the following restriction holds for the URL syntax:

URL_DATASTORE Attributes

URL_DATASTORE has the following attributes:

Attribute Attribute Values

timeout

Specify the timeout in seconds. The valid range is 15 to 3600 seconds. The default is 30.

maxthreads

Specify the maximum number of threads that can be running simultaneously. Use a number between 1and 1024. The default is 8.

urlsize

Specify the maximum length of URL string in bytes. Use a number between 32 and 65535. The default is 256.

maxurls

Specify maximum size of URL buffer. Use a number between 32 and 65535. The defaults is 256.

maxdocsize

Specify the maximum document size. Use a number between 256 and 2,147,483,647 bytes (2 gigabytes). The defaults is 2,000,000.

http_proxy

Specify the host name of http proxy server. Optionally specify port number with a colon in the form hostname:port.

ftp_proxy

Specify the host name of ftp proxy server. Optionally specify port number with a colon in the form hostname:port.

no_proxy

Specify the domain for no proxy server. Use a comma separated string of up to 16 domain names.

timeout

Specify the length of time, in seconds, that a network operation such as a connect or read waits before timing out and returning a timeout error to the application. The valid range for timeout is 15 to 3600 and the default is 30.


Note:

Since timeout is at the network operation level, the total timeout may be longer than the time specified for timeout.


maxthreads

Specify the maximum number of threads that can be running at the same time. The valid range for maxthreads is 1 to 1024 and the default is 8.

urlsize

Specify the maximum length, in bytes, that the URL data store supports for URLs stored in the database. If a URL is over the maximum length, an error is returned. The valid range for urlsize is 32 to 65535 and the default is 256.


Note:

The product values specified for maxurls and urlsize cannot exceed 5,000,000.

In other words, the maximum size of the memory buffer (maxurls * urlsize) for the URL is approximately 5 megabytes.


maxurls

Specify the maximum number of rows that the internal buffer can hold for HTML documents (rows) retrieved from the text table. The valid range for maxurls is 32 to 65535 and the default is 256.


Note:

The product values specified for maxurls and urlsize cannot exceed 5,000,000.

In other words, the maximum size of the memory buffer (maxurls * urlsize) for the URL is approximately 5 megabytes.


Specify the maximum size, in bytes, that the URL datastore supports for accessing HTML documents whose URLs are stored in the database. The valid range for maxdocsize is 1 to 2,147,483,647 (2 gigabytes), and the default is 2,000,000.

http_proxy

Specify the fully qualified name of the host machine that serves as the HTTP proxy (gateway) for the machine on which Oracle Text is installed. You can optionally specify port number with a colon in the form hostname:port.

You must set this attribute if the machine is in an intranet that requires authentication through a proxy server to access Web files located outside the firewall.

ftp_proxy

Specify the fully-qualified name of the host machine that serves as the FTP proxy (gateway) for the machine on which Oracle Text is installed. You can optionally specify a port number with a colon in the form hostname:port.

This attribute must be set if the machine is in an intranet that requires authentication through a proxy server to access Web files located outside the firewall.

no_proxy

Specify a string of domains (up to sixteen, separate by commas) which are found in most, if not all, of the machines in your intranet. When one of the domains is encountered in a host name, no request is sent to the machine(s) specified for ftp_proxy and http_proxy. Instead, the request is processed directly by the host machine identified in the URL.

For example, if the string us.oracle.com, uk.oracle.com is entered for no_proxy, any URL requests to machines that contain either of these domains in their host names are not processed by your proxy server(s).

URL_DATASTORE Example

This example creates a URL_DATASTORE preference called URL_PREF for which the http_proxy, no_proxy, and timeout attributes are set. The defaults are used for the attributes that are not set.

begin
 ctx_ddl.create_preference('URL_PREF','URL_DATASTORE');
 ctx_ddl.set_attribute('URL_PREF','HTTP_PROXY','www-proxy.us.oracle.com');
 ctx_ddl.set_attribute('URL_PREF','NO_PROXY','us.oracle.com');
 ctx_ddl.set_attribute('URL_PREF','Timeout','300');
end;

Create the table and insert values into it:

create table urls(id number primary key, docs varchar2(2000));
insert into urls values(111555,'http://context.us.oracle.com');
insert into urls values(111556,'http://www.sun.com');
commit;
 

To create the index, specify URL_PREF as the datastore:

create index datastores_text on urls ( docs ) 
  indextype is ctxsys.context 
  parameters ( 'Datastore URL_PREF' ); 


USER_DATASTORE

Use the USER_DATASTORE type to define stored procedures that synthesize documents during indexing. For example, a user procedure might synthesize author, date, and text columns into one document to have the author and date information be part of the indexed text.

The USER_DATASTORE has the following attributes:

Attribute Attribute Value

procedure

Specify the procedure that synthesizes the document to be indexed.

This procedure must be owned by CTXSYS and must be executable by the index owner.

output_type

Specify the data type of the second argument to procedure. Valid values are CLOB, BLOB, CLOB_LOC, BLOB_LOC, or VARCHAR2. The default is CLOB.

When you specify CLOB_LOC, BLOB_LOC, you indicate that no temporary CLOB or BLOB is needed, since your procedure copies a locator to the IN/OUT second parameter.

procedure

Specify the name of the procedure that synthesizes the document to be indexed. This specification must be in the form PROCEDURENAME or PACKAGENAME.PROCEDURENAME. The schema owner name is constrained to CTXSYS, so specifying owner name is not necessary.

The procedure you specify must have two arguments defined as follows:

procedure (r IN ROWID, c IN OUT NOCOPY <output_type>)

The first argument r must be of type ROWID. The second argument c must be of type output_type. NOCOPY is a compiler hint that instructs Oracle to pass parameter c by reference if possible.


Note::

The procedure name and its arguments can be named anything. The arguments r and c are used in this example for simplicity.


The stored procedure is called once for each row indexed. Given the rowid of the current row, procedure must write the text of the document into its second argument, whose type you specify with output_type.

Constraints

The following constraints apply to procedure:

Editing Procedure after Indexing

If you change or edit the stored procedure, indexes based upon it will not be notified, so you must manually re-create such indexes. So if the stored procedure makes use of other columns, and those column values change, the row will not be re-indexed. The row is re-indexed only when the indexed column changes.

output_type

Specify the datatype of the second argument to procedure. You can use either CLOB, BLOB, CLOB_LOC, BLOB_LOC, or VARCHAR2.

USER_DATASTORE with CLOB Example

Consider a table in which the author, title, and text fields are separate, as in the articles table defined as follows:

create table articles( 
    id       number, 
    author   varchar2(80), 
    title    varchar2(120), 
    text     clob );

The author and title fields are to be part of the indexed document text. Assume user appowner writes a stored procedure with the user datastore interface that synthesizes a document from the text, author, and title fields:

create procedure myproc(rid in rowid, tlob in out clob) is 
  begin 
      for c1 in (select author, title, text from articles 
                  where rowid = rid) 
      loop 


   dbms_lob.writeappend(tlob, length(c1.title), c1.title);
   dbms_lob.writeappend(tlob, length(c1.author), c1.author);
   dbms_lob.writeappend(tlob, length(c1.text), c1.text);
end loop; end;

This procedure takes in a rowid and a temporary CLOB locator, and concatenates all the article's columns into the temporary CLOB. The for loop executes only once.

Only procedures owned by CTXSYS are allowed for the user datastore. Therefore, CTXSYS must wrap the user procedure (owned by appowner) with a CTXSYS owned procedure as follows:

create procedure s_myproc(rid in rowid, tlob in out clob) is 
 begin 
      appowner.myproc(rid, tlob); 
 end; 
 

The CTXSYS user must make sure that the index owner can execute the stub procedure by granting execute privileges as follows:

grant execute on s_myproc to appowner ;
 

The user appowner creates the preference, setting the procedure attribute to the name of the ctxsys stub procedure as follows:

begin


ctx_ddl.create_preference('myud', 'user_datastore'); 
ctx_ddl.set_attribute('myud', 'procedure', 's_myproc'); 
ctx_ddl.set_attribute('myud', 'output_type', 'CLOB'); 
end;

When appowner creates the index on articles(text) using this preference, the indexing operation sees author and title in the document text.

USER_DATASTORE with BLOB_LOC Example

The following procedure might be used with OUTPUT_TYPE BLOB_LOC:

procedure myds(rid in rowid, dataout in out nocopy blob)
is
  l_dtype varchar2(10);
  l_pk    number;
begin
  select dtype, pk into l_dtype, l_pk from mytable where rowid = rid;
  if (l_dtype = 'MOVIE') then
    select movie_data into dataout from movietab where fk = l_pk;
  elsif (l_dtype = 'SOUND') then
    select sound_data into dataout from soundtab where fk = l_pk;
  end if;
end;

Because only procedures owned by CTXSYS are allowed for the user datastore, CTXSYS must wrap the user procedure (owned by appowner) with a CTXSYS owned procedure as follows:

create procedure s_myproc(rid in rowid, tlob in out blob) is 
 begin 
      appowner.myds(rid, tlob); 
 end; 
 

The CTXSYS user must make sure that the index owner can execute the stub procedure by granting execute privileges as follows:

grant execute on s_myproc to appowner ;
 

The user appowner creates the preference, setting the procedure and output_type attributes to correspond to the ctxsys stub procedure as follows:

begin


ctx_ddl.create_preference('myud', 'user_datastore'); 
ctx_ddl.set_attribute('myud', 'procedure', 's_myproc'); 
ctx_ddl.set_attribute('myud', 'output_type', 'blob_loc'); 
end;

NESTED_DATASTORE

Use the nested datastore type to index documents stored as rows in a nested table.

Attribute Attribute Value

nested_column

Specify the name of the nested table column.This attribute is required. Specify only the column name. Do not specify schema owner or containing table name.

nested_type

Specify the type of nested table. This attribute is required. You must provide owner name and type.

nested_lineno

Specify the name of the attribute in the nested table that orders the lines. This is like DETAIL_LINENO in detail datastore. This attribute is required.

nested_text

Specify the name of the column in the nested table type that contains the text of the line. This is like DETAIL_TEXT in detail datastore. This attribute is required. LONG column types are not supported as nested table text columns.

binary

Specify FALSE for Oracle to automatically insert a new line between lines when synthesizing the document text. If you specify TRUE, Oracle does not do this. This attribute is not required. The default is FALSE.

When using the nested table datastore, you must index a dummy column, because the extensible indexing framework disallows indexing the nested table column. See the example.

DML on the nested table is not automatically propagated to the dummy column used for indexing. For DML on the nested table to be propagated to the dummy column, your application code or trigger must explicitly update the dummy column.

Filter defaults for the index are based on the type of the nested_text column.

During validation, Oracle checks that the type exists and that the attributes you specify for nested_lineno and nested_text exist in the nested table type. Oracle does not check that the named nested table column exists in the indexed table.

NESTED_DATASTORE Example

Create the Nested Table

The following code creates a nested table and a storage table mytab for the nested table:

create type nt_rec as object (
  lno number, -- line number
  ltxt varchar2(80) -- text of line
);

create type nt_tab as table of nt_rec;
create table mytab (
   id number primary key, -- primary key
   dummy char(1), -- dummy column for indexing
   doc nt_tab -- nested table
)
nested table doc store as myntab;
Insert Values into Nested Table

The following code inserts values into the nested table for the parent row with id equal to 1.

insert into mytab values (1, null, nt_tab());
insert into table(select doc from mytab where id=1) values (1, 'the dog');
insert into table(select doc from mytab where id=1) values (2, 'sat on mat ');
commit;
Create Nested Table Preferences

The following code sets the preferences and attributes for the NESTED_DATASTORE according to the definitions of the nested table type nt_tab and the parent table mytab:

begin
-- create nested datastore pref
ctx_ddl.create_preference('ntds','nested_datastore'); 

-- nest tab column in main table
ctx_ddl.set_attribute('ntds','nested_column', 'doc'); 

-- nested table type
ctx_ddl.set_attribute('ntds','nested_type', 'scott.nt_tab');

-- lineno column in nested table
ctx_ddl.set_attribute('ntds','nested_lineno','lno');

--text column in nested table
ctx_ddl.set_attribute('ntds','nested_text', 'ltxt');
end;
Create Index on Nested Table

The following code creates the index using the nested table datastore:

create index myidx on mytab(dummy) -- index dummy column, not nest table
indextype is ctxsys.context parameters ('datastore ntds');
Query Nested Datastore

The following select statement queries the index built from a nested table:

select * from mytab where contains(dummy, 'dog and mat')>0;
-- returns document 1, since it has dog in line 1 and mat in line 2.

Filter Types

Use the filter types to create preferences that determine how text is filtered for indexing. Filters allow word processor and formatted documents as well as plain text, HTML, and XML documents to be indexed.

For formatted documents, Oracle stores documents in their native format and uses filters to build temporary plain text or HTML versions of the documents. Oracle indexes the words derived from the plain text or HTML version of the formatted document.

To create a filter preference, you must use one of the following types:

Filter Preference type Description

CHARSET_FILTER

Character set converting filter

INSO_FILTER

Inso filter for filtering formatted documents

NULL_FILTER

No filtering required. Use for indexing plain text, HTML, or XML documents

USER_FILTER

User-defined external filter to be used for custom filtering

PROCEDURE_FILTER

User-defined stored procedure filter to be used for custom filtering.

CHARSET_FILTER

Use the CHARSET_FILTER to convert documents from a non-database character set to the database character set.

CHARSET_FILTER has the following attribute:

Attribute Attribute Value

charset

Specify the Globalization Support name of source character set.

If you specify UTF16AUTO, this filter automatically detects the if the character set is UTF16 big- or little-endian.

Specify JAAUTO for Japanese character set auto-detection. This filter automatically detects the custom character specification in JA16EUC or JA16SJIS and converts to the database character set. This filter is useful in Japanese when your data files have mixed character sets.

See Also:

Oracle9i Globalization and National Language Support Guide for more information about the supported Globalization Support character sets.

UTF-16 Big- and Little-Endian Detection

If your character set is UTF-16, you can specify UTF16AUTO to automatically detect big- or little-endian data. Oracle does so by examining the first two bytes of the document row.

If the first two bytes are 0xFE, 0xFF, the document is recognized as little-endian and the remainder of the document minus those two bytes is passed on for indexing.

If the first two bytes are 0xFF, 0xFE, the document is recognized as big-endian and the remainder of the document minus those two bytes is passed on for indexing.

If the first two bytes are anything else, the document is assumed to be big-endian and the whole document including the first two bytes is passed on for indexing.

Indexing Mixed-Character Set Columns

A mixed character set column is one that stores documents of different character sets. For example, a text table might store some documents in WE8ISO8859P1 and others in UTF8.

To index a table of documents in different character sets, you must create your base table with a character set column. In this column, you specify the document character set on a per-row basis. To index the documents, Oracle converts the documents into the database character set.

Character set conversion works with the CHARSET_FILTER. When the charset column is NULL or not recognized, Oracle assumes the source character set is the one specified in the charset attribute.


Note:

Character set conversion also works with the INSO_FILTER when the document format column is set to TEXT.


Indexing Mixed-Character Set Example

For example, create the table with a charset column:

create table hdocs (
     id number primary key,
     fmt varchar2(10),
     cset varchar2(20),
     text varchar2(80)
);

Insert plain-text documents and name the character set:

insert into hdocs values(1, 'text', 'WE8ISO8859P1', '/docs/iso.txt');
insert in hdocs values (2, 'text', 'UTF8', '/docs/utf8.txt');
commit;

Create the index and name the charset column:

create index hdocsx on hdocs(text) indextype is ctxsys.context
  parameters ('datastore ctxsys.file_datastore 
  filter ctxsys.charset_filter 
  format column fmt
  charset column cset');

INSO_FILTER

The INSO_FILTER is a universal filter that filters most document formats. This filtering technology is licensed from Stellent Chicago, Inc.

Use it for indexing single and mixed-format columns.

See Also:

For a list of the formats supported by INSO_FILTER and to learn more about how to set up your environment to use this filter, see Appendix B, "Supported Document Formats".

The INSO_FILTER has the following attribute:

Attribute Attribute Values

timeout

Specify the INSO_FILTER timeout in seconds. Use a number between 0 and 42,949,672. Default is 120. Setting this value 0 disables the feature.

How this wait period is used depends on how you set timeout_type.

This feature is disabled for rows for which the corresponding charset and format column cause the INSO_FILTER to bypass the row, such as when format is marked TEXT.

Use this feature to prevent the Oracle indexing operation from waiting indefinitely on a hanging Inso filter operation.

timeout_type

Specify either HEURISTIC or FIXED. Default is HEURISTIC.

Specify HEURISTIC for Oracle to check every TIMEOUT seconds if output from Output in HTML Export has increased. The operation terminates for the document if output has not increased. An error is recorded in the CTX_USER_INDEX_ERRORS view and Oracle moves to the next document row to be indexed.

Specify FIXED to terminate the Outside In HTML Export processing after TIMEOUT seconds regardless of whether filtering was progressing normally or just hanging. This value is useful when indexing throughput is more important than taking the time to successfully filter large documents.



Indexing Formatted Documents

To index a text column containing formatted documents such as Microsoft Word, use the INSO_FILTER. This filter automatically detects the document format. You can use the CTXSYS.INSO_FILTER system-defined preference in the parameter clause as follows:

create index hdocsx on hdocs(text) indextype is ctxsys.context
  parameters ('datastore ctxsys.file_datastore 
  filter ctxsys.inso_filter');

Bypassing Plain Text or HTML in Mixed Format Columns

A mixed-format column is a text column containing more than one document format, such as a column that contains Microsoft Word, PDF, plain text, and HTML documents.

The INSO_FILTER can index mixed-format columns. However, you might want to have the INSO filter bypass the plain text or HTML documents. Filtering plain text or HTML with the INSO_FILTER is redundant.

The format column in the base table allows you to specify the type of document contained in the text column. The only two types you can specify are TEXT and BINARY. During indexing, the INSO_FILTER ignores any document typed TEXT (assuming the charset column is not specified.)

To set up the INSO_FILTER bypass mechanism, you must create a format column in your base table.

For example:

create table hdocs (
     id number primary key,
     fmt varchar2(10),
     text varchar2(80)
);

Assuming you are indexing mostly Word documents, you specify BINARY in the format column to filter the Word documents. Alternatively, to have the INSO_FILTER ignore an HTML document, specify TEXT in the format column.

For example, the following statements add two documents to the text table, assigning one format as BINARY and the other TEXT:

insert into hdocs values(1, 'binary', '/docs/myword.doc');
insert in hdocs values (2, 'text', '/docs/index.html');
commit;

To create the index, use CREATE INDEX and specify the format column name in the parameter string:

create index hdocsx on hdocs(text) indextype is ctxsys.context
  parameters ('datastore ctxsys.file_datastore 
  filter ctxsys.inso_filter 
  format column fmt');

If you do not specify TEXT or BINARY for the format column, BINARY is used.


Note:

You need not specify the format column in CREATE INDEX when using the INSO_FILTER.


Character Set Conversion With Inso

The INSO_FILTER converts documents to the database character set when the document format column is set to TEXT. In this case, the INSO_FILTER looks at the charset column to determine the document character set.

If the charset column value is not an Oracle character set name, the document is passed through without any character set conversion.


Note:

You need not specify the charset column when using the INSO_FILTER.


If you do specify the charset column and do not specify the format column, the INSO_FILTER works like the CHARSET_FILTER, except that in this case there is no Japanese character set auto-detection.

See Also:

"CHARSET_FILTER".

Plain Text Indexing and the INSO_FILTER

Oracle does not recommend using INSO_FILTER to index plain text documents.

If your table contains text documents exclusively, use the NULL_FILTER or the USER_FILTER.

If your table contains text documents mixed with formatted documents, Oracle recommends creating a format column and marking the text documents as TEXT to bypass INSO_FILTER. In such cases, Oracle also recommends creating a charset column to indicate the document character set.

However, if you use INSO_FILTER to index nonbinary (text) documents and you specify no format column and no charset column, the INSO_FILTER processes the document. Your indexing process is thus subject to the character set limitations of Inso technology. Specifically, your application must ensure that one of the following conditions is true:

NULL_FILTER

Use the NULL_FILTER type when plain text or HTML is to be indexed and no filtering needs to be performed. NULL_FILTER has no attributes.

Indexing HTML Documents

If your document set is entirely HTML, Oracle recommends that you use the NULL_FILTER in your filter preference.

For example, to index an HTML document set, you can specify the system-defined preferences for NULL_FILTER and HTML_SECTION_GROUP as follows:

create index myindex on docs(htmlfile) indextype is ctxsys.context 
  parameters('filter ctxsys.null_filter
  section group ctxsys.html_section_group');
See Also:

For more information on section groups and indexing HTML documents, see "Section Group Types" in this chapter.


USER_FILTER

Use the USER_FILTER type to specify an external filter for filtering documents in a column. USER_FILTER has the following attribute:

Attribute Attribute Values

command

Specify the name of the filter executable.

command

Specify the executable for the single external filter used to filter all text stored in a column. If more than one document format is stored in the column, the external filter specified for command must recognize and handle all such formats.

The executable you specify must exist in the $ORACLE_HOME/ctx/bin directory. You must create your user-filter executable with two parameters: the first is the name of the input file to be read, and the second is the name of the output file to be written to.

If all the document formats are supported by INSO_FILTER, use INSO_FILTER instead of USER_FILTER unless additional tasks besides filtering are required for the documents.

User Filter Example

The following example perl script to be used as the user filter. This script converts the input text file specified in the first argument to uppercase and writes the output to the location specified in the second argument:

#!/usr/local/bin/perl

open(IN, $ARGV[0]);
open(OUT, ">".$ARGV[1]);

while (<IN>)
{
  tr/a-z/A-Z/;
  print OUT;
}

close (IN);
close (OUT);

Assuming that this file is named upcase.pl, create the filter preference as follows:

begin 
  ctx_ddl.create_preference 
    ( 
      preference_name => 'USER_FILTER_PREF', 
      object_name     => 'USER_FILTER' 
    ); 
  ctx_ddl.set_attribute
    ('USER_FILTER_PREF','COMMAND','upcase.pl');
end; 

Create the index in SQL*Plus as follows:

create index user_filter_idx on user_filter ( docs ) 
  indextype is ctxsys.context 
  parameters ('FILTER USER_FILTER_PREF'); 

PROCEDURE_FILTER

Use the PROCEDURE_FILTER type to filter your documents with a stored procedure. The stored procedure is called each time a document needs to be filtered.

This type has the following attributes:

Attribute Purpose Allowable Values

procedure

Name of the filter stored procedure.

Any CTXSYS owned procedure. The procedure can be a PL/SQL stored procedure.

input_type

Type of input argument for stored procedure.

VARCHAR2, BLOB, CLOB, FILE

output_type

Type of output argument for stored procedure.

VARCHAR2, CLOB, FILE

rowid_parameter

Include rowid parameter?

TRUE/FALSE

format_parameter

Include format parameter?

TRUE/FALSE

charset_parameter

Include charset parameter?

TRUE/FALSE

procedure

Specify the name of the stored procedure to use for filtering. The procedure can be a PL/SQL stored procedure. The procedure can be a safe callout or call a safe callout.

The procedure must be owned by CTXSYS and have one of the following signatures:

PROCEDURE(IN BLOB, IN OUT NOCOPY CLOB)
PROCEDURE(IN CLOB, IN OUT NOCOPY CLOB)
PROCEDURE(IN VARCHAR, IN OUT NOCOPY CLOB)
PROCEDURE(IN BLOB, IN OUT NOCOPY VARCHAR2)
PROCEDURE(IN CLOB, IN OUT NOCOPY VARCHAR2)
PROCEDURE(IN VARCHAR2, IN OUT NOCOPY VARCHAR2)
PROCEDURE(IN BLOB, IN VARCHAR2)
PROCEDURE(IN CLOB, IN VARCHAR2)
PROCEDURE(IN VARCHAR2, IN VARCHAR2)

The first argument is the content of the unfiltered row as passed out by the datastore. The second argument is for the procedure to pass back the filtered document text.

The procedure attribute is mandatory and has no default.

input_type

Specify the type of the input argument of the filter procedure. You can specify one of the following:

Type Description

BLOB

The input argument is of type BLOB. The unfiltered document is contained in the BLOB passed in.

CLOB

The input argument is of type CLOB. The unfiltered document is contained in the CLOB passed in.

No pre-filtering or character set conversion is done. If the datastore outputs binary data, that binary data is written directly to the CLOB, with Globalization Support doing implicit mapping to character data as best it can.

VARCHAR2

The input argument is of type VARCHAR2. The unfiltered document is contained in the VARCHAR2 passed in.

The document can be a maximum of 32767 bytes of data. If the unfiltered document is greater than this length, an error is raised for the document and the filter procedure is not called.

FILE

The input argument is of type VARCHAR2. The unfiltered document content is contained in a temporary file in the file system whose filename is stored in the VARCHAR2 passed in.

For example, the value of the passed-in VARCHAR2 might be 'tmp/mydoc.tmp' which means that the document content is stored in the file '/tmp/mydoc.tmp'.

The file input type is useful only when your procedure is a safe callout, which can read the file.

The input_type attribute is not mandatory. If not specified, BLOB is the default.

output_type

Specify the type of output argument of the filter procedure. You can specify one of the following types:

Type Description

CLOB

The output argument is IN OUT NOCOPY CLOB. Your procedure must write the filtered content to the CLOB passed in.

VARCHAR2

The output argument is IN OUT NOCOPY VARCHAR2. Your procedure must write the filtered content to the VARCHAR2 variable passed in.

FILE

The output argument must be IN VARCHAR2. On entering the filter procedure, the output argument is the name of a temporary file. The filter procedure must write the filtered contents to this named file.

Using a FILE output type is useful only when the procedure is a safe callout, which can write to the file.

The output_type attribute is not mandatory. If not specified, CLOB is the default.

rowid_ parameter

When you specify TRUE, the rowid of the document to be filtered is passed as the first parameter, before the input and output parameters.

For example, with INPUT_TYPE BLOB, OUTPUT_TYPE CLOB, and ROWID_PARAMETER TRUE, the filter procedure must have the signature as follows:

procedure(in rowid, in blob, in out nocopy clob)

This attribute is useful for when your procedure requires data from other columns or tables. This attribute is not mandatory. The default is FALSE.

format_parameter

When you specify TRUE, the value of the format column of the document being filtered is passed to the filter procedure before input and output parameters, but after the rowid parameter, if enabled.

You specify the name of the format column at index time in the parameters string, using the keyword 'format column <columnname>'. The parameter type must be IN VARCHAR2.

The format column value can be read via the rowid parameter, but this attribute allows a single filter to work on multiple table structures, because the format attribute is abstracted and does not require the knowledge of the name of the table or format column.

FORMAT_PARAMETER is not mandatory. The default is FALSE.

charset_parameter

When you specify TRUE, the value of the charset column of the document being filtered is passed to the filter procedure before input and output parameters, but after the rowid and format parameter, if enabled.

You specify the name of the charset column at index time in the parameters string, using the keyword 'charset column <columnname>'. The parameter type must be IN VARCHAR2.

CHARSET_PARAMETER attribute is not mandatory. The default is FALSE.

Parameter Order

ROWID_PARAMETER, FORMAT_PARAMETER, and CHARSET_PARAMETER are all independent. The order is rowid, the format, then charset, but the filter procedure is passed only the minimum parameters required.

For example, assume that INPUT_TYPE is BLOB and OUTPUT_TYPE is CLOB. If your filter procedure requires all parameters, the procedure signature must be:

(id IN ROWID, format IN VARCHAR2, charset IN VARCHAR2, input IN BLOB, output IN 
OUT NOCOPY CLOB)

If your procedure requires only the ROWID, then the procedure signature must be:

(id IN ROWID,input IN BLOB, ouput IN OUT NOCOPY CLOB)

Create Index Requirements

In order to create an index using a PROCEDURE_FILTER preference, the index owner must have execute permission on the procedure.Oracle checks this at index time, which is similar to the security measures for USER_DATASTORE.

Error Handling

The filter procedure can raise any errors needed through the normal PL/SQL raise_application_error facility. These errors are propagated to the CTX_USER_INDEX_ERRORS view or reported to the user, depending on how the filter is invoked.

Procedure Filter Preference Example

Consider a filter procedure CTXSYS.NORMALIZE that you define with the following signature:

PROCEDURE NORMALIZE(id IN ROWID, charset IN VARCHAR2, input IN CLOB, 
output IN OUT NOCOPY VARCHAR2);

To use this procedure as your filter, set up your filter preference as follows:

begin
ctx_ddl.create_preference('myfilt', 'procedure_filter');
ctx_ddl.set_attribute('myfilt', 'procedure', 'normalize');
ctx_ddl.set_attribute('myfilt', 'input_type', 'clob');
ctx_ddl.set_attribute('myfilt', 'output_type', 'varchar2');
ctx_ddl.set_attribute('myfilt', 'rowid_parameter', 'TRUE');
ctx_ddl.set_attribute('myfilt', 'charset_parameter', 'TRUE');
end;

Lexer Types

Use the lexer preference to specify the language of the text to be indexed. To create a lexer preference, you must use one of the following lexer types:

type Description

BASIC_LEXER

Lexer for extracting tokens from text in languages, such as English and most western European languages that use white space delimited words.

MULTI_LEXER

Lexer for indexing tables containing documents of different languages

CHINESE_VGRAM_LEXER

Lexer for extracting tokens from Chinese text.

CHINESE_LEXER

Lexer for extracting tokens from Chinese text.

JAPANESE_VGRAM_LEXER

Lexer for extracting tokens from Japanese text.

JAPANESE_LEXER

Lexer for extracting tokens from Japanese text.

KOREAN_LEXER

Lexer for extracting tokens from Korean text.

KOREAN_MORPH_LEXER

Lexer for extracting tokens from Korean text (recommended).

USER_LEXER

Lexer you create to index a particular language.

BASIC_LEXER

Use the BASIC_LEXER type to identify tokens for creating Text indexes for English and all other supported whitespace delimited languages.

The BASIC_LEXER also enables base-letter conversion, composite word indexing, case-sensitive indexing and alternate spelling for whitespace delimited languages that have extended character sets.

In English and French, you can use the BASIC_LEXER to enable theme indexing.


Note:

Any processing the lexer does to tokens before indexing (for example, removal of characters, and base-letter conversion) are also performed on query terms at query time. This ensures that the query terms match the form of the tokens in the Text index.


BASIC_LEXER supports any database character set.

BASIC_LEXER has the following attributes:

Attribute Attribute Values

continuation

characters

numgroup

characters

numjoin

characters

printjoins

characters

punctuations

characters

skipjoins

characters

startjoins

non alphanumeric characters that occur at the beginning of a token (string)

endjoins

non alphanumeric characters that occur at the end of a token (string)

whitespace

characters (string)

newline

NEWLINE (\n)

CARRIAGE_RETURN (\r)

base_letter

NO (disabled)

YES (enabled)

mixed_case

NO (disabled)

YES (enabled)

composite

DEFAULT (no composite word indexing, default)

GERMAN (German composite word indexing)

DUTCH (Dutch composite word indexing)

index_stems

0 NONE

1 ENGLISH

2 DERIVATIONAL

3 DUTCH

4 FRENCH

5 GERMAN

6 ITALIAN

7 SPANISH

index_themes

YES (enabled)

NO (disabled, default)

index_text