+

Search Tips   |   Advanced Search

Crawl an external site using a seedlist provider


Overview

The seedlist HTTP crawler can crawl external sites that publish content using the ATOM/XML-based format, which supports publishing only updated content between crawling sessions for more effective crawling. We configure the seedlist crawler with general parameters, filters and schedulers, then run the crawler.

Before configuring the seedlist crawler, collect the following information:

  • Root URL of the seedlist page.

    The seedlist page contains...

    • Metadata that directs the crawler to the actual links to be fetched and indexed.
    • Document level metadata stored along with the document in the search index.

  • User ID and Password of seedlist page


Configure the seedlist crawler

  1. Go to...

      Manage Search | Search Services | Portal Search Service | search collection | New Content Source | Content source type (drop-down) | Seedlist provider

  2. Under the tabs General Parameters, Advanced parameters, Schedulers and Security provide the information in the fields and select options as required.

  3. Click Create.

    This creates the new content source.

  4. To run the crawler, click the start crawler icon (right-pointing arrow) next to the content source name on the Content Sources page.

    If we have defined a crawler schedule under the Schedulers tab, the crawler will start at the next possible time specified.


Parent Search and crawl portal and other sites

Related reference:
Apply filter rules
Manage Search