<update (9/21/2009): This post has been update to include additional information that was accidentally left off of the original post>
<update 2 (10/15/2009) The link to part 3 is now added in the first part of the overview below>
Overview
In part 1 of this series, we covered what is new in Exchange 2007 search. Part 2 will focus on Content Indexing. Part 3 can be found here.
As soon as the Exchange Search services are started, a worker thread begins work to determine the status of each database on the server. The MonitorAndUpdateMDBList worker thread is responsible for determining the status of each database on the server. That status is stored in memory and the process runs about every 30 seconds to keep the status up to date. If a database is online and enabled for indexing, a catalog object is created in memory that holds one of three values; New, Crawling, or Notification. A value of New will initiate the creation of a property store and catalog for the database; once these are created, the value is changed to Crawling. The Crawling value identifies databases that are performing their initial crawl. A Notification value identifies a database that has finished its initial crawl and is ready to process events from the Event History table.
Crawling - The catalog for a database will hold the Crawling value until all mailboxes in that database have been initially indexed. By searching AD, the Content Indexer will build a Mailbox List in memory for all mailboxes in the database. Each mailbox will be given a value in the Property Store of NotStarted, NormalCrawlInProgress, and Done. At startup there are ten worker threads created that are dedicated to crawling. Those threads process mailboxes from the Mailbox List and remove a mailbox from the list once the items in the mailbox have been indexed. Once the list is empty and all mailboxes have been indexed, the catalog value for the database will change to Notification. Note that in Exchange 2007, unlike older versions, crawling cannot be scheduled. Crawling is performed on demand. This means that some indexing processes do not run continuously, but rather intermittently.
Notification Indexing - Once the initial crawl has occurred on a database, a process called the Notification Watcher constantly checks the Event History table for entries to be indexed. Checking one database at a time, the Notification Watcher can read up to 2000 events at a time. Watermarks are used so the Notification Watcher knows where it left off and where to begin next. Events that are found as interesting are added to the notification queue to be indexed.
When a mailbox is moved to a new database a process called One-Off Crawling is used to index the mailbox. Once the mailbox has been indexed it returns to a Notification status and normal notification processing resumes. This notification indexing reduces the disc I/O overhead. See this link for additional information on the effects of Content Indexing on a server:
Configuring, validating and monitoring your Exchange 2007 storage - Content Indexing
http://msexchangeteam.com/archive/2007/01/15/432199.aspx.
Indexing can be disabled on a specific mailbox database by using the Exchange Management Shell;
Set-MailboxDatabase
Components of Content Indexing
The four main components of Content Indexing are Store.exe, MSExchangeSearch, MS Search, and MS Search Filter Daemon. The components work together using Named Pipes, shared memory, and COM/RPC to build the index catalogs and respond to client queries. Below, Figure 2.1 illustrates the components that make up Exchange 2007 Search. Following the diagram is a detailed explanation of each component.
Figure 2.1 Exchange 2007 Search Components
Store.exe
(MsExchangeIS) Microsoft Information Store Service
The Exchange 2007 Information Store contains important subcomponents used in Content Indexing and Search, the Event History Table, Property Store, and Query Processor.
Event History Table - this jet database table in an information store mailbox database, like its name implies, contains numbered events that are written each time an important event occurs in the store. Many Exchange-related services will read through this table looking for events that are "interesting" or important to its specific function and ignoring events that are of no importance. Mailbox items that are new, changed, deleted or moved trigger an event that is added to this table. Such an event lets Exchange Search know there is an item needing to be indexed. The MSExchangeSearch component is responsible for reading this table continuously (about once a millisecond). There is one Event History Table per mailbox database. More information on this process is explained later in this Bulletin.
Property Store - previous versions of search also used a Property Store; however, the Property Store was kept as a separate file from the Information Store database. The Property Store is now a jet database table in the information store database containing metadata for indexed items; there is one Property Store table per mailbox database. The Property Store contains properties about indexed items that help match entries in the index catalog to objects in the store. MSExchangeSearch uses the Document ID (assigned during indexing) to search for a match in the Property Table to the Entry ID (FID/MID or Folder ID/Message ID) of the document. Search also checks the Property Table for the current indexing status of the document.
Query Processor - The information store utilizes MSSSearch 3.0 for queries in Exchange 2007. Previous versions of Exchange used MSSearch 2.0. When a client sends a search request to the store the Query Processor is initiated. The Query Processor builds the search request and works with MSSearch to find and return the data requested back to the client. This process is more completely explained in the third post of this blog series.
MSExchangeSearch
(Microsoft.Exchange.Search.ExSearch.exe) Microsoft Exchange Search Indexer
One of the four main components, Microsoft Exchange Search Indexer (MSExchangeSearch), is responsible for all index enabled mailbox stores on a server. Anytime a message is modified, created, deleted, or moved, an event is created in the Event History Table. MSExchangeSearch reads the Event History Table. Events that MSExchangeSearch finds as interesting are added to a queue to be processed by the Indexer. Events are not removed from this queue until notified by MSSearch that they have been successfully indexed. This all happens extremely quickly and it is why the catalog is never more than a few minutes out of date. In addition, MSExchangeSearch is responsible for writing and maintaining the metadata to the Property Store for the indexed items: Document ID, Entry ID and the indexed state of the item. If a database catalog is deleted or deemed out of date, the MSExchangeSearch service is responsible for initializing the new crawl of the database.
msftesql-Exchange
Microsoft Search (Exchange) - MSSearch 3.0
Another main component of Content Indexing, the responsibility of msftesql-Exchange is reading and writing to the index catalog. Created during the initial crawl process, the catalog files and directory are created in the same location as the database files. This path of the catalog cannot be changed. However, moving a mailbox database will move the catalog. Restoring a database from backup does not restore the catalog. However, a new index crawl is initiated on a new catalog. Other responsibilities of the msftesql-Exchange service are performing admin functions, executing full-text queries from the store's query engine and managing the Filter Daemon.
The ResetSearchIndex.ps1 script can force a rebuild of the catalog for a specific database on your server; ResetSearchIndex.ps1 [-force] <dbname> [<dbname2>]. This process will remove and recreate the index catalog. The index catalog files can be considered expendable, if the catalog is found to be more than seven days out of date the catalog will be discarded and a new crawl and catalog will be initialized. Corruption, accidental deletion, or simply troubleshooting search problems are some of the reasons to manually rebuild the Index Catalog.
How to Rebuild the Full-Text Index Catalog
http://technet.microsoft.com/en-us/library/aa995966.aspx
Rebuilding the catalog will resolve issues of corruption with Index Catalog files as noted in this KB article; The Outlook Web Access search function does not work for some users in Exchange 2007
http://support.microsoft.com/kb/945077
During server startup, the msftesql-Exchange service is set to manual and MSExchangeSearch is set to automatic. MSExchangeSearch cannot start until msftesql-Exchange starts. Msftesql-Exchange spawns msftefd.exe. The chart below shows the relationships of the three services and processes. Note that none of the three services or processes depends on the Microsoft Information Store Service and the Microsoft Information Store Service does not depend on any of three services below.
Service Name | Startup Type | Display Name | Depends on |
Msftesql-Exchange | Manual | Microsoft Search (Exchange) | Remote Procedure Call (RPC) |
MSExchangeSearch | Automatic | Microsoft Exchange Search Indexer | MsftesqlExchange MSExchangeAD Topology |
Msftefd.exe (process) | Spawned by Msftesql-Exchange* | Msftesql-Exchange |
Note: The Msftefd.exe process is instantiated and terminated by MSFTESQL-Exchange on an as-needed basis. This process is instantiated for both crawling and index maintenance and terminated when it is idle for a specific time.
Msftefd.exe
Microsoft Search Filter Daemon
The Filter Daemon is responsible for running through the words and character streams and applying filters and word breakers in the indexing process. The actual process is as follows: after all the data from the item is streamed from the store to the Filter Daemon, the content is passed through the filters and word breakers. The Filter Daemon breaks the textual stream into words, removes noise words (like "the" "and" etc...) and passes the words to be indexed to MSSearch 3.0 to create the actual index entries in the catalog.
Filters are used to extract the text from specific types of documents, html, doc, xml, xls, pdf, and so on. In the registry under HKLM\Software\Microsoft\Exchange\MSSearch\Filters there will be a list of filters that the server is able to use; see picture below.
Office 2007 documents are not indexed by default ( docx, xlsx, ect) the addition of this IFilter pack would allow indexing of these attachments. If an extension is not listed in the registry we simply skip the attachment and index the rest of the message. You can enable additional file types to be indexed by registering Filter Pack IFilters. For further information, see this article:
944516 How to register Filter Pack IFilters with Exchange Server 2007
http://support.microsoft.com/default.aspx?scid=kb;EN-US;944516
Word Breakers understand the rules of language and are used to convert strings of characters to words, and words to word tokens that are then passed to the msftesql-Exchange service to be indexed and written to the catalog. MAPI.net is used in Exchange 2007 to expose data in the Information Store to the Filter Daemon for indexing.
During the indexing process if any part of the message should fail to be indexed the entire message fails to be indexed. For example, if while indexing a message containing a docx attachment if we fail to open and index the attachment then the entire message is skipped, we will not index the body of the message. This is different from Exchange 2003. However, if an IFilter for a specific attachment type is not listed in the registry, we will skip indexing of that attachment type and the message body will be indexed.
Noise Words in Exchange 2007
The Filter Daemon also removes Noise Words. The query processor has a mechanism that discards from the query commonly occurring words that do not factor into the search. These words are called noise words. Noise words are listed in the locale specific noise word files on the server. For example, in the English (US) locale, words such as "a," "and," "is," and "the" are in the English noise word file (if one exists) can be left out of the full-text index since they are empirically known to be useless to a search. The query processor determines the noise word file to use based on the locale of the caller making the query. The query processor removes any of these words from the restriction prior to optimization since they would not be found in the full-text index. Therefore, note that Noise words are subtracted both from the Index by the Filter Daemon during creating of the index and from the Search by the Query Processor.
NOTE, by default, there are no noise word files in Exchange 2007. However, there is the capability to create and use noise word files in Exchange 2007. Noisexxx.txt is the name of the file and the xxx depends on Language ID. For example, noiseenu.txt would be for English (US) and noisefra.txt would be for French.
The noise word file, when present, should be located in the Exchange install directory in the bin/FTERef subdirectory in files with names following the pattern: "noisexxx.txt". For example, if you install Exchange 2007 on the C: drive and use English (US) language for your noise word file, it would be located here with the following name:
C:\Program Files\Microsoft\Exchange Server\Bin\FTERef\noiseenu.txt
The complete list of language codes for noise word file names are listed below.
{ 0x0804, L"CHS" }, // Simplified chinese (PRC)
{ 0x0404, L"CHT" }, // Traditional chinese (Tiawan)
{ 0x0406, L"DAN" }, // Danish
{ 0x0407, L"DEU" }, // German
{ 0x0409, L"ENU" }, // English (US)
{ 0x0809, L"ENG" }, // English (UK)
{ 0x0C0A, L"ESN" }, // Spanish
{ 0x040C, L"FRA" }, // French
{ 0x0410, L"ITA" }, // Italian
{ 0x0411, L"JPN" }, // Japanese
{ 0x0412, L"KOR" }, // Korean
{ 0x0413, L"NLD" }, // Dutch
{ 0x0415, L"PLK" }, // Polish
{ 0x0416, L"PTB" }, // Portuguese
{ 0x0419, L"RUS" }, // Russian
{ 0x041D, L"SVE" }, // Swedish
{ 0x041E, L"THA" }, // Thai
{ 0x041F, L"TRK" } // Turkish
Content Indexing and Exchange 2007 High Availability
Exchange 2007 offers high availability options, Single Copy Cluster (SCC), Cluster Continuous Replication (CCR), and Local Continuous Replication (LCR).
- SCC mailbox servers share a single instance of a mailbox database and index catalog. There is no change in the Content Index process for a SCC.
- With CCR mailbox servers there are two instances of a mailbox database: one active and one passive. An instance of content indexer on each node creates a unique catalog for each database. Each catalog has a unique GUID held in the database that matches it to the content indexer on the node that created it. When failing over, the second catalog will always be used with the second database, and the first catalog with the first database. One current limitation is that there is no way to detect how up-to-date or how healthy the catalog on the passive node is. The MSExchangeSearch process on the passive node continuously updates the catalog, so that it can be used for fail-over, at any time.
- In an LCR implementation, there is only a single copy of the content index catalog. When the offline database on an LCR server is activated the original catalog is not automatically moved over, this can be manually copied to the new active database location and the index will be up to date. If the catalog is not copied over, a new catalog will be created and a full crawl will begin.
Summary
In summary, Content Indexing in Exchange 2007 includes the capability to check the indexing status of each database every 30 seconds. Crawling in Exchange 2007 is performed on demand rather than a schedule. Notifications are sent to the indexing process to know what is already indexed and what needs to be indexed next in a queue.
The components of Exchange 2007 Content Indexing include the Microsoft Information Store, the Microsoft Exchange Search Indexer, Microsoft Search (Exchange) (MSSearch 3.0), and Microsoft Search Filter Daemon. The Filter Daemon uses Filters and Word Breakers to create the Index.
The Filter Daemon also removes Noise words during crawling to create the Index. Exchange 2007 includes the capability to add Noise word files. By default, there are no Noise word files provided with Exchange 2007.
SCC mailbox servers share a single instance of a mailbox database and catalog while CCR mailbox servers contains one mailbox database associated with its own catalog and a copy of the mailbox database associated with its own unique catalog. In an LCR implementation, there is only a single copy of the Catalog.
-- Bob Want and Jack French
You Had Me at EHLO.