Limiting the crawler to n levels or directories
A common feature request is the ability to limit the crawler to a certain number of levels, or directories, when crawling a site. For example, a crawler limited to 2 levels would index "http://xav.com/1/2.html" but not "http://xav.com/1/2/3.html".
This functionality can be realized using Filter Rules. Go to "Admin Page" => "Filter Rules" and choose "Create New Rule". Use the following parameters:
- Name: "Limited Levels"
- Enabled: checked
- Action: Deny
- Analyze: URL
- Apply rule *unless* analyzed text contains at least *1* of
- Substring: ^http://[^/]+(/[^/]*){0,1}$
The final integer in the regex is the maximum level allowed. Examples:
| Depth | substring | Examples |
|---|---|---|
| First-level | ^http://[^/]+(/[^/]*){0,1}$ | http://xav.com/ http://xav.com/contact.html |
| Second-level | ^http://[^/]+(/[^/]*){0,2}$ | http://xav.com/ http://xav.com/contact.html http://xav.com/scripts/ |
| Third-level | ^http://[^/]+(/[^/]*){0,3}$ | http://xav.com/ http://xav.com/scripts/ http://xav.com/scripts/search/index.html |
See also: Filter Rules: Creating a new Filter Rule
"Limiting the crawler to n levels or directories"
http://www.xav.com/scripts/search/help/1031.html