Home > Fluid Dynamics Search Engine > Help > 1068

Calculating last-modified time

The crawler will first attempt to parse the "last-modified" META header of an HTML page. Recognized time formats are:

Sat, 07 Apr 2001 00:58:08 GMT (HTTP standard time format)
Saturday, 08-Sep-2001 21:46:40 EDT (Apache server SSI format)

Generally, the format must be "day-monthname-year hour:minute:second". Leading and trailing data like weekday and timezone are stripped. (In the case of the time zone, this is a bug, yes, sorry.)

If the META header isn't present or cannot be parsed, other methods will be used:

In general, web pages don't have "last-modified" META or HTTP headers, and so the last modified time will be equal to the index time. Files discovered via the file system will have accurate last modified information.

On Apache, you can make the last modified time available with the following code in your header:

<meta name="last-modified" content="<!--#echo var="LAST_MODIFIED" -->" />

The exact regular expression for parsing string-based times is:


Tip: On the Apache web server, the last modified HTTP header is often returned only if the HTML or SHTML file is executable. If the file has only read permission, then no last modified information is returned. Setting all of your HTML documents to executable will allow FDSE to learn about their last modified times, and will also dramatically improve web server performance because it allows certain caching technologies to function. See this article for more information.

History: applies to FDSE version and newer. Prior to that version, the file system crawler would still get accurate last modified times using stat, but the web crawler would just use the current time as the last modified time.

In FDSE versions through .0053, the regular expression used was:

(\d+)( |-)(\w\w\w)( |-)(\d+) (\d+)\:(\d+)\:(\d+)

The new regex is improved in that it allows for multiple spaces and allows for a missing "seconds" field.

    "Calculating last-modified time"