Home News Business Simpana™ 7.0 - Content Indexing
Simpana™ 7.0 - Content Indexing
Thursday, 01 January 2009
Simpana™ 7.0 - Content Indexing

Frequently Asked Questions

Q. What’s a Content Indexing Cluster or Cloud?
A. The term Cluster, Cloud or Engine all mean the same thing – a unified collection of
Content Indexing nodes operating under the same Administration/Identity. Of Cluster,
Cloud, or Engine, the preferred term is Engine. A Content Indexing node is a Windows
2003 (32 bit) Server with CommVault’s Content Indexing software installed. Nodes can
be configured to perform different or multiple roles of the Content Indexing process.

Q. Can a Content Indexing Engine consist of a single Content
Indexing Node?

A. Yes, all content indexing roles can be performed by a single CI node. This single
node configuration has limited usage for small, non dynamic environments and is not
recommended for most production environments. A single node Content Indexing
Engine with no other function can support indexing about 15 million objects. In low
search demand environments you can co-locate a Web Search Server on the same
node/engine. In this configuration, the indexing support limit drops to about 10 million
objects.

Q. What counts as an indexed object?

A. All index-able files such as Microsoft Office and SharePoint documents, PDF files,
Exchange and Domino Lotus Notes messages, and their attachments are considered
objects. An index-able object is one that contains readable text. In the case of a mail
message with attachments, the message and each attachment is a distinct object.
Binary (e.g. .exe, .dll), image (e.g. jpg, .gif, .bmp), and database files (e.g. .dbf, .mdf, .ldf)
are not index-able. You can establish filters as to what file types are indexed or not
indexed to the Content Indexing Engine. There is a maximum ingestible object size limit
of 50MB. Objects whose current size (compressed or uncompressed) is greater than
50MB are not indexed. This limit can be set lower.

Q. How do I scale out indexing volume in a Content Indexing
Engine?

A. Up to nine (9) content indexing nodes can be configured within a single Content
Indexing Engine. Only one of these nodes can have the Admin role. The Admin node
acts as the entry/exit point for all data and provides the management interface. To scale
out index volume, all nodes can be configured to perform Search and Index roles. Each
node is allocated a portion of the document database (fixml). The Admin node functions
as the distribution/load balancing manager. Each Search and Index node can handle
about 15 million objects. Hence a nine node Content Indexing Engine can index about
120 million objects. While you can add nodes for volume scaling at anytime, it is best for
load balancing to have all expected nodes in place upon start of indexing.

Q. How much space does the content indexing require?

A. The dynamic volume of a content indexing process is made up of the SQL compatible
document database (fixml) and the associated index referred to together as the index.
The document database can be backed up/restored and the associated index regenerated.
The consumed disk space of the database and index is dependent upon
several factors.
• Choice of Lemmatization (Full or Slim)
• Size/word density of the objects being indexed
• Retention of the database/index
With respect to object size/word density of objects on an average uncompressed file
server, Office documents generally require 5 - 15% of their file size in content index
space. Emails can require 50 - 100+% or their original size primarily due to the fact that
they are heavily text based and our storage of email in archive form is highly
compressed.
There are times where 2x+ the size of the index will be required during the normal (preconfigured/
regularly scheduled) maintenance of the indexes. This allows for the content
index to be searchable, while the maintenance processes are performed.

Q. What’s the difference between Full and Slim Lemmatization?

A. The choice of the Full or Slim Lemmatization profile is made during installation of the
Content Indexing Engine software. Full Lemmatization enables dynamic teasers and
expansion of key word searches to include plural and tense variation. A dynamic teaser
is the relevant surrounding text to the search words/phrase that is displayed for matched
objects. The alternative provided with Slim Lemmatization is a static teaser made up of
text from the beginning of the object – regardless of where in the object the matched
words/phrase was found. The key word expansion feature of Full Lemmatization would
in a search for run also find ran and running or a search for car would also include the
plural form cars, but not career.

It is not possible to change the lemmatization mode on a per query basis. Changing the
lemmatization profile requires re-indexing of all documents.
Initial tests indicate that choosing the Slim profile when setting up content indexing show
a reduction of 35-40% in the index size.

Q. What’s the difference between Offline and Online Content
Indexing?

A. The primary differences between Offline and Online Content Indexing are where the
data is sourced from and the types of data that can be indexed. Online Content Indexing
is sourced from a Windows Client host and only includes File System objects. Offline
Content Indexing is sourced from a protected storage policy copy and includes Archived
and Backed up Windows File System files, Exchange Mailbox messages, SharePoint
Documents, and Archived Domino Mail Server messages. Note that Domino Mail Server
messages that have been backed up only or any other document type cannot be content
indexed.

Q.How do I enable Offline Content Indexing?

A. To enable Offline Content Indexing, a Content Indexing Engine must be installed first.
The source data for offline content indexing is located on a storage policy copy. Open
the storage policy’s properties dialog and select the Content Indexing tab. Check the
option to Enable Content Indexing and select the Content Indexing Engine to use.
You also need to go to each associated Client Computer’s Properties dialog and select
the Advanced Tab. Check the option to Enable Content Indexing. Checking the client’s
Enable Content Indexing option consumes a client content indexing license.
Storage policy data from the Client’s supported and associated subclients will not be
eligible for content indexing until all the above steps are done.
Default settings on the Storage Policy’s Content Indexing tab will include all subclient
data on the Primary copy with no filters and retain the index in accordance with the data
retention policy for the Primary copy. Change the configuration options for content, filters
and retention as desired.
Offline Content Indexing jobs must be scheduled separately. They operate similar to
Auxiliary copy jobs in that parallel streams can be used. Previously indexed data is not
re-indexed unless manually selected to do so. Indexes are, by default, retained as long
as there is a copy of the indexed object in the storage policy. Content Index retention
rules can be set to prune indexes before the source object is pruned.

Q. How do I enable Online Content Indexing?

A. Online Content Indexing capability is provided via an iDataAgent installed on the host
computer. The Online Content Indexing iDataAgent requires the host computer to also
have installed a File System iDataAgent, Data Classification Enabler and .Net
Framework 2.0. Once installed, the user can use the default subclient or create
additional subclients in which to define the content to be indexed, filtered, and the
Content Indexing Engine to use.
Online Content Indexing jobs must be scheduled. Full or Incremental jobs are supported.
With an Incremental Content Indexing job, only data that has not been previously
indexed will be included in the job.

Q. How are the indexes retained?

A. Online content indexes are self maintained based on the existence of the source
object. If the source object is deleted, the corresponding index will be deleted with the
next content indexing job.
Offline content indexes are retained, by default, the same as the oldest retained copy of
the object in the storage policy. Optionally, you can set the index retention to a specified
number of days. Setting index retention shorter than the object’s retention in protected
storage might be done to save index space. Retaining the index longer would allow later
searches to locate the object in the index, but the object would not exist for viewing.
Since Offline jobs can be re-indexed at anytime, the decision to use longer retention or
re-index as necessary is dependent upon the type, nature, and frequency of searches.

Q. How does encryption impact Content Indexing?

A. Data that is software encrypted by the file system or other 3rd party applications on
the host or data that is encrypted by CommVault® software using the Pass Phrase
option cannot be content indexed. Objects that require decryption will, of course, take
longer to index.

Q. How does compression impact Content Indexing?

A. The content indexing process will expand compressed files (if the compressed size is
less than the max acceptable file size) and index their content. Objects that require
decompression will, of course, take longer to index.

Q. Are there tunable performance parameters for the Content
Indexing Engine?

A. The Content Indexing Engine is pre-configured for optimal performance. There are no
user configurable parameters that impact performance. Better performance can be achieved
by having sufficient CPU and RAM memory resources and high performance disk and data
path hardware. The Online Books specify the minimum system requirements however
acceptable performance in higher demand environments may not be achievable with just the
minimum.

©1999-2008 CommVault Systems, Inc. All rights reserved. CommVault, CommVault and logo, the “CV” logo, CommVault Systems, Solving Forward, SIM, Singular Information Management, Simpana, CommVault Galaxy, Unified Data Management,
QiNetix, Quick Recovery, QR, CommNet, GridStor, Vault Tracker, InnerVault, QuickSnap, QSnap, Recovery Director, CommServe and CommCell and are trademarks or registered trademarks of CommVault Systems, Inc. All other third party brands, products, service names, trademarks, or registered service marks are the property of and used to identify the products or services of their respective owners. All specifications are subject to change without notice.
Comments (0)Add Comment

Write comment
You must be logged in to a comment. Please register if you do not have an account yet.

busy