Portal Content search

 

+
Search Tips   |   Advanced Search

 


Contents

  1. Overview
  2. Create the document collection:
  3. Document Search portlet
  4. Install the Document Search portlet
  5. Install the Taxonomy Manager portlet
  6. Additional features and options in the Manage Collections portlet

 

Overview

WebSphere Portal Version 5.0.2 provides a search engine, the key features of which include:

  1. Crawl multiple sites
  2. Configure periodic crawls per site
  3. Free text search
  4. Trailing wild card character
  5. Browsing of collection
  6. Categorization of incoming documents using either a static taxonomy or simple rules.
  7. Approve documents before they are added to the collection and the index
  8. Allow editing of document metadata
  9. Start and stop crawls manually
  10. Enhanced monitoring of document collection process
  11. Multiple language support
  12. Socks and proxy server for indexing external sites in the portlet
  13. Summarizer

Note that anonymous users cannot search when the Search Center portlet is deployed on a public page. This fix extends this functionality to unauthenticated users. To fix, download PK09511.

 

Configure Search

  1. Define the content that you want to make available for search.

  2. Define the properties of the full text index.

  3. Select global acceptance of documents returned by a crawl by enabling the option...

    Add all documents to collection automatically

    ...or accept documents individually after a crawl by clicking the option...

    Pending documents

    ...and selecting desired documents.

  4. Install the Document Search user portlet.

 

Summarizer

Facility for summarizing Web pages, culling the most salient sentences.

The sumamrizer can produce summaries for languages with an associated stemmer program. The summarizer uses stems as the base forms for words, as opposed to the lemma forms used by summarizers which have dictionaries.

A stem is generally seen as the morphological root of an inflected (or sometimes derived) word form. Stemming is mostly used in Information Retrieval to refer to approaches that strip off suffixes (or what looks like suffixes) and return the remainder as stem. For example in Dutch, the strings "vangen", "vangt" and "gevangen" should be attributed the stem "vang".

A lemma is something quite similar, but still slightly different: Lemmatisation takes a word and returns its baseform or citation form (the canonical dictionary entry form). This means that the same type (word form) can be assigned different lemmas, e.g. "aaltje" (small eel) with lemma "aal" (eel) vs. "aaltje" (sort of worm) with lemma "aaltje" (sort of worm).

The summarizer technology is provided as a separate component for use in portlet application development. For example, with an integrated search service.

Portal Content search can index content stored in different languages and make it available for search. It uses the unicode setting of the source content to crawl and index content for search. It supplies a choice of tokenizers selectable by administrators: N-gram indexing and Linguistic indexing. N-gram are sequences of n consecutive characters in a document. N-gram are generated from a document by sliding a "window" across the text of the document, moving it by one character at a time. N-gram have several advantages over words for use in indexing. First, they are language independent, therefore mixed text can be indexed easily. They are useful for Asian languages in which word tokenization is more difficult, for example Chinese, Japanese, Korean, and Thai. Linguist indexing is based on a morphological analyzer that reduces terms to their base. It can be usefully applied in most situations when indexing sources with both English and non-English content.

 

Predefined static taxonomy and categorizer

The WebSphere Portal Categorization Facility allows categorization of documents in any of over 2,300 subjects. These subjects are grouped in the following main business category areas:

Portal users can use the Categorization Facility to build applications that automatically determine the subject of documents which fall within any of these areas It can evaluate and categorize documents in the languages English, French, Italian, and German.

The portal Categorization Facility consists of two major components, a Categorizer and a Taxonomy Manager. Custom categories can be created by..

The categorizer looks for an exact match, including capitalization.

 

Product Name Categories

For a product name category you can choose any word or phrase, but you would most commonly use the names of your company's products or services. You create one category for each product or group of products. For example, you can create a new category named "WebSphere Portal" using the WebSphere Portal Taxonomy Manager. By default this creates a model for that category consisting of the phrase "WebSphere Portal". The categorizer then looks for occurrences of that phrase in all documents, and counts the number of such occurrences. The categorizer multiplies the number of occurrences with the weight assigned to that phrase to compute a score. If the calculated score is greater than or equal to the current value of MinUserCatScore as described in the list of parameters below, then the categorizer reports that the document belongs to that category. A given document can belong to more than one Product Name Category.

 

Synonyms

You can assign any number of synonyms to the standard set of categories shipped with WebSphere Portal or your product name categories. You can also assign synonyms to interior nodes of the taxonomy. Each synonym is used to help the categorizer identify other instances of that category. Common synonyms can be other spellings or capitalization patterns. They can also just be other phrases that signify a particular category. For example, if the documents you categorize often use the name of a product in all capital letter spelling, you create a synonym such as WEBSPHERE PORTAL.

The best way to decide whether you need a synonym is to examine your documents to see what forms of the category name are used in practice. At the time you create the synonym, you are prompted to assign a weight to it. The categorizer multiplies the number of occurrences of a synonym in a particular document with that weight to calculate a score, and adds it to the score for that category.

Example: A document is to be categorized. The categorizer reports the two top categories as "Drinking Water Protection" and "Drinking Water Treatment" with scores of 0.24 and 0.25, respectively. You assign "watershed protection" with a weight of 0.05 as a synonym to "Drinking Water Protection". If this new synonym is found once in this document, this alters the scores to 0.29 and 0.25, respectively. Consequently, the "best" answer from the categorizer is now "Drinking Water Protection."

If you find that a category does not find all desired documents on a particular topic, add synonyms. You can assign the desired weight to each synonym. However, in general you may find it best to use a weight of no more than 50% of the MinUserCatScore for synonyms to product name categories, and no more than 50% of the MinCatCos for synonyms to the standard WebSphere Portal categories. This ensures that a document must contain at least two mentions of a synonym to be categorized as belonging to that category.

 

Categorizer Parameters

The categorizer has a number of adjustable parameters. They can be set to achieve various results. The parameters are controlled by entries in the file ModelCategorizer.properties . You find this file in the following location:

  wp_root/shared/app/eureka/resources/LL/CategorizerModel-yyyy-mmm-dd-LL-wps.zip

where LL indicates the language code, such as fr, en, it, or de. An example is: CategorizerModel-2003-Jul-10-en-wps.zip.

The settings in the file supplied with the portal are configured with values for the best general usage. However, advanced administrators may decide to modify them. If you want to modify the properties file, extract it from the ZIP file and modify it. Then leave the properties file in the same directory where the ZIP file is. You do not need to replace the properties file in the ZIP file with the new one.

The default settings for the parameters in the properties file are as follows:

Super category threshold

MinSuperCatCos = 0.05

Category threshold

MinCatCos = 0.24

Value by which the 2nd and 3rd cosines must be in order to remain part of the result set

SuperCatProximity = 0.04

Minimum score allowed for user categories in the ProperName Categorizer

MinUserCatScore = 0.20

The parameters and their settings are explained in the following:

MinSuperCatCos

This is the super category threshold. The MinSuperCatCos value is a number between 0 and 1. Typical values are between 0.05 and 0.15. The higher the value, the more stringent the categorizer is in determining the super category, or collection of categories, to which the document belongs. For shorter documents or for less professionally written documents, use a value closer to 0.05; for longer and more professional documents, use a higher value. Web pages often tend toward the shorter and less professional side; for those a setting of 0.05 is recommended. In any case, the value should be substantially lower than MinCatCos.

MinCatCos

This is the category threshold. The MinCatCos value is a number between 0 and 1. Typical values are between 0.15 and 0.27. The higher the value is, the more stringent the categorizer is in determining the category to which the document belongs. Typical Web pages categorize best with a value of 0.24; however, short documents may categorize well with a lower value. Values slightly above 0.24 may be appropriate for single-topic documents that are professionally authored and of significant length, that is several hundred words.

SuperCatProximity

This is the value by which the second and third cosines must be in order to remain part of the result set. The SuperCatProximity value is a number between 0 and 1. Typical values are in the range of 0.01 to 0.08. The higher the value is, the more likely the categorizer is to consider a broader set of super categories. Generally, this should be left at the default setting of 0.04.

MinUserCatScore

This is the minimum score allowed for user categories in the ProperName Categorizer. The MinUserCatScore applies to the user created model data as described in the Customization section above. It can have a value between zero ( 0 ) and infinity. The higher the value is, the more stringent the categorizer is in determining the product name category to which the document belongs. A document is assigned to a product name category when the product name score for that category is at or above the MinProperNameEurekaScore. As the default score for each newly created product name category entry is 0.1, the default threshold of 0.2 implies that the Product Name Category must occur at least twice in the document for the document to be scored as belonging to that Product Name Category.

 

WebSphere Portal Taxonomy Manager

The WebSphere Portal Taxonomy Manager portlet helps you manage the pre-defined static taxonomy. Use it to perform a wide range of administrative tasks on the categories that constitute the taxonomy, including the following:

More than one user can use the Taxonomy Manager at a time. However, only one user at a time should use it to change the taxonomy; other users should utilize it only to view the taxonomy. Therefore it is recommended that the portal administrator assigns the editor role for the taxonomy manager to only one user.

 

Taxonomy Manager Portlet panels

The WebSphere Portal Taxonomy Manager portlet consists of a set of view panels. Each of the panels displays a different view of your taxonomy.

Taxonomy Tree panel

This is the most important one of the panels. It displays the current view of your taxonomy. Each node of the taxonomy is displayed on a single line. The line consists of the name of the category displayed as a Web link. By clicking on the link you can display all of the subcategories, if any, of that category. For example, clicking on "Sales, Marketing, and Advertising Industries" displays the subcategories "Advertising," "Marketing" and "Sales." In front of each node in the taxonomy is a colored dot:

  • A green dot means that the node is active, that is all categories of the node are being used.

  • A yellow dot means that the node is collapsed, that is categories under this node are not displayed.

  • A red dot means that the node has been deleted.

Clicking on a node also selects that node for editing tasks as described in the following.

Taxonomy Search Page panel

Search for a particular word or phrase in any part of the taxonomy. You can also search for a category by its category ID. Use this search feature if indexing gives you too few or too many documents with a certain keyword, you can look up where that keyword occurs in the taxonomy.

Proper Name panel

Display and change the proper names and synonyms associated with each category. It is normally visible by default. It can also be invoked using the Edit Proper Name task.

 

Typical Usage of the Taxonomy portlet

As with other WebSphere Portal portlets, you first log in to the portal. You can then launch the Taxonomy Manager portlets. The exact details of this will vary, depending upon how your company has installed the portlet.

Normally you first load the current taxonomy. Depending on how your portal administrator has set up the portlet, it will probably load your company taxonomy by default. However, if you want to load a different taxonomy, you can do so with the Load Taxonomy action. This will display your taxonomy in the Taxonomy Tree portlet.

 

Set the Default Taxonomy

The administrator can change the default taxonomy by changing the name in the portal.xml file distributed with the taxonomy manager.

 

Giving Edit or Read-Only Permissions

The administrator can control which users can edit the taxonomy by giving edit users access to the Taxonomy Manager portlet and read-only users access to the Taxonomy Viewer portlet.

 

Categorizer Parameters

The administrator should consult the documentation for the Model-based Categorizer to determine how best to set the categorizer parameters. In particular, the settings for some parameters will affect how the user-assigned Weights affect the categorizer results.

 

Manage collections - Create the document collection

To create/update the document collection, go to...

Administration | Portal Settings | Manage Collections

With the Document Collections box, you can...

  1. Create collection.

  2. Select a collection and perform one of the following tasks:

    Delete Delete the selected document collection.
    Refresh Manually refresh the selected document collection. The index performs a complete re-crawl on all the sites of the document collection.
    Import or export Import or export the selected document collection by using the portal search XML interface.

    The export and import operations can be of benefit when you upgrade to software levels which are not necessarily compatible with the data storage format of older versions of the software. To prevent loss of data, you export all data of document collections to XML files before upgrading the software. Then after upgrading the software level, you can use the previously exported files to return the document collection data back into the new software level.

    Pending Documents Documents returned by a crawl of the selected document collection. Use to edit, accept, or reject documents.
    Category Tree. If you are using a rule based taxonomy use this option to manage that taxonomy's categories and filter rules.

  3. View the following Collection Status information...

    Last update completed: Date when a site defined for the document collection was last updated by a scheduled update.
    Next update scheduled: Date when the next update of a site defined for the document collection is scheduled.
    # of active documents: Number of active documents in the document collection, that is, all documents that are available.
    # of deleted documents: Number of documents that have been marked for deletion.
    Collection Name and Location: Name and location of the selected document collection in the file system. This is the full path where all data and related information of the document collection is stored.
    Collection Language: Language for which the document collection and its index is optimized.

    The index uses this language to analyze the documents when indexing, if no other language is specified for the document. This feature enhances the quality of search results for users, as it allows them to use spelling variants, including plurals and inflections, for the search keyword.

    Categorizer used: Categorizer that is used by the document collection.
    Summarizer used: Shows whether a static summarizer is enabled for this the document collection.

    To update the status information, click the refresh button of the browser. You can click the arrow icon to collapse or expand the Collection Status section.

 

Manage the sites of a document collection

In the Sites in Collection box, you can work with sites which belong to the document collection you selected from the Document Collections box. A document collection can be configured to cover more than one site. Sites in Collection allows you to do the following in relation to the document collection which you selected from the Document Collections list:

 

Document Search portlet

Portal users use the Document Search portlet to search documents and content. Before portal users can use the Document Search portlet, perform the following tasks for preparation:

  1. Prepare document collections
  2. Prepare the taxonomy and categories for documents if desired.
  3. Install the Document Search portlet for users.

 

Install the Document Search portlet for users

Once you have built the document collection and the associated index, you deploy the Document Search portlet. Users can then use the index to perform searches and browse the document collection. The search portlet WAR file is located in...

wp_root/install/SearchPortlets.war

After you have installed the search portlet, you can configure the index by going to...

Administration | Manage Portlets | Document Search | Edit

Set the IndexName parameter of the portlet to the name of your document collection. If you have more than one document collection built and maintained, create a new copy of the search portlet for each additional collection, and update its configuration parameters accordingly.

 

Use the Document Search portlet on anonymous pages

You might want to put the Document Search portlet on an anonymous page so that users can use it without having to log in to the portal. If you do this, you need to enable public sessions for your portal. The reason is that the document search portlet needs a valid session for its run time, and by default, sessions are not enabled on anonymous pages in the portal. By default, sessions are only created when a user authenticates and logs in to the portal server.

You can enable public sessions by editing the file...

wp_root/shared/app/config/services/NavigatorService.properties

and set public.session to true.

Restart both WebSphere Application Server and WebSphere Portal for your changes to take effect.

 

Install the Taxonomy Manager portlet

To edit the predefined taxonomy, install and deploy the Taxonomy Manager portlet. Install...

wp_root/installableApps/TaxonomyEditor.war

After installation deploy the Taxonomy Editor portlet. For example, deploy the file on the Manage Search Index page. To deploy, use the Manage Portlets portlet on the Portlets page under Portal administration.

 

Additional features and options in the Manage Collections portlet

 

Manage the category tree for a document collection

If you associated a document collection with a user defined rule based categorizer at creation time, you can define its categories and create filter rules per category.

Rules determine which categories are associated with documents. They control which of the documents that are fetched from the sites enter the document collection, and to which categories they are assigned:

The categories that are defined per site are a subset of the entire category tree. The category tree is arranged in a hierarchy. The tree starts with the Root category. All other categories stem from the Root category.

If you do not have the option Add all documents to collection automatically enabled, you can always change the automated association created by the system between a document and a category. You perform this change from the Pending Documents panel, before the document is indexed and cataloged.

To manage the categories for a document collection associated with a rule based categorizer, proceed as follows:

  1. Select the desired document collection from the document collection list. This document collection needs to have a rule based categorizer.

  2. Click Category Tree next to the document collection list. Manage Collections displays the Manage Category Tree panel. It shows a box named Category Tree which shows a tree view of the categories, and a box named Manage categories which lets you manage the categories and rules for the taxonomy.

  3. Proceed with one of the tasks described in the following:

 

Manage categories

Manage categories for the selected document collection comprises the following tasks:

 

Create a new category

To add a new category to a document collection associated with a rule based categorizer, proceed as follows:

  1. Select the desired parent category under which you want to add a new category from the tree view.

  2. Enter the name for the new category in the entry field Sub-category name.

  3. Click Create. Manage Collections adds the new category to the taxonomy and displays it in the tree view.

 

Renaming a category

To rename a category in the taxonomy tree, proceed as follows:

  1. Select the desired category which you want to rename from the tree view.

  2. Enter the new name for the category name in the entry field Current category.

  3. Click Rename. Manage Collections renames the category and displays the new name in the tree view.

 

Deleting a category

To delete a category from the taxonomy tree, proceed as follows:

  1. Select the desired category which you want to delete from the tree view.

  2. Click Delete. Manage Collections removes the category with all its subcategories from the taxonomy and the tree view.

 

Manage category rules

Rules are applied to documents when inserting them into a collection. There are two types of rules:

URL rule

A URL rule applies to the documents URL. It is expressed as a pseudo "regular expression". It describes a partial URL. All documents which have the rule text as a substring in their URL pass the rule.
Example: if the rule text is *hr*, then the URL...

http://myco.com/internal/hr/local/default.htm

...passes the rule.

http://myco.com/internal/finance/local/default.htm

...does not pass the rule.

Content rule

A content rule is applied to the text of the document. It is expressed in the same format as a query. If the document is valid for this query, it passes the rule.

Examples: the rule hr "human resources" specifies documents that contain the term hr or the phrase human resources. The rule +hr -benefits specifies documents that contain the term hr but not the term benefits. This applies to words in their stemmed form. If you selected Unspecified for the language of the document collection, it applies to words in non-stemmed form.

You can perform the following tasks with Manage Category Rules:

 

Create a rule

To create a rule for a category, proceed as follows:

  1. Select the category for which you want to create a rule from the tree view.

  2. Click Create in the Manage Category Rules box. Manage Collections displays the Create Category rule box.

  3. In the entry field, type a name for the rule.

  4. Depending on the rule type you want to create, select URL rule or content rule.

  5. In the entry field type the details of the rule. For a content rule, type the strings to be applied. For a URL rule, type the partial URL string.

  6. Click Create to create the rule.

  7. The new rule is added to list in the Rules box. It can then be selected and associated with a category. It is then used during crawling and indexing.

 

Associating a rule with a category

To associate a rule with a category, proceed as follows:

  1. Select the desired category with which you want to associate a rule from the tree view.

  2. Click Select in the Manage Category Rules box. Manage Collections displays the Select box for selecting rules.

  3. Select a rule from the rule list.

  4. Click Add to add the rule to the selected rule list.

  5. Select additional rules as desired.

  6. Click OK when done selecting rules. The Manage Collections portlet associates the selected rules with the category and returns to the Manage Category Tree panel.

 

Dissociating a rule from a category

To dissociate a rule from a category, proceed as follows:

  1. Select the desired category from which you want to dissociate the rule from the tree view.

  2. Select the rule which you want to dissociate from the category from the Manage Category Rules box.

  3. Click Delete. This rule is no longer associated with the category.

To add and delete rules from the system, click the Manage Rules button.

 

Browsing a Document Collection

To browse a document collection proceed as follows:

  1. Select the document collection which you want to browse.

  2. Click Browse Documents. The Browse Documents panel is displayed.

From this panel you can browse through the entire document collection. You can also remove documents and edit the metadata associated with documents as in the Pending Documents panel. If a collection is associated with a category tree, you can navigate the tree and see which documents are associated with each category.

 

Verify address

A new option Verify Address has been added to the Sites in Collection box. Verify the URL address of a selected site.

Select the site which you want to verify and click Verify Address. If the Web site is available and not blocked by a Robot.txt file, Manage Collections returns the message Site is OK. If the site is invalid, inaccessible, or blocked, Manage Collections returns an error message.

When you create a new site, Manage Collections invokes the Verify Address feature.

 

Obey robot.txt file

A new option Obey Robot.txt has been added to the Add Site and Edit Site panels. If you select this option, the crawler observes the restrictions specified in the robot.txt file when accessing URLs fro documents. Select this option by marking the Obey Robot.txt check box.

 

Confirmation for Delete actions

For all deletion task a confirmation box has been added. For example, this applies to deleting collections and sites. Click OK to confirm the deletion of the selected item, or click Cancel if you do not want to delete the item.

See also