USGenNet's Search Information for Webmasters

I. How to Get the Most out of USGenNet's Search Engine
(also applies to all Internet web-crawlers)

USGenNet's Search Engine is built upon HtDig, a popular and flexible collection of WWW search engine software. This software is designed to function in a manner similar to common Internet-based search engines. This includes the use of a web-crawler for retrieving and indexing the content of web pages.

What is a web-crawler?

Before your site can be searched, it must first be "crawled" (also known as "spidered" or "indexed"). Only after your site has been crawled for words and hyper-links, and that information indexed, is it possible for the site to be searched. But (an important one) when a web-crawler visits your site, it ignores web pages that cannot be reached by following links specified in other web pages. Often the crawl begins at the site's default page.

What are default pages?

Default pages are the "front door" to both your web account and all its sub-directories. In general, a well designed site will be laid out in a manner allowing a user starting at the default page to reach any other page on the site simply by clicking on links. A web-crawler assumes this layout and will miss pages that do not satisfy this expectation. Put simply, if you cannot reach a page on your site by following links from your default page, it is likely that a web-crawler will never see that page. Default pages are defined by using special file names that the web server is configured to recognize. Although index.html is likely the most popular choice at this time, the USGenNet web server is configured to recognizes all of the following as default pages:

index.htm/index.html
default.htm/default.html
home.htm/home.html

How do web crawlers work?

Once a web-crawler enters your site (through the front door of the default page), it searches your site by following links. If the file containing your home page is not named in a manner that the web server recognizes as a default page, or if you have files that are not linked from any other page on your web site, the result will be an incomplete crawl. This in turn means that some or all of the content of your site will not be available during a search.

In other words, if your home page has a default name, the web-crawler begins there, first reading that entire page, and then indexing the content. It also builds a list of all the links it finds on that page. It next follows those links and indexes the corresponding pages, adding any new links it finds to a running list. Once the list of links is exhausted, the crawler assumes it has gathered everything of interest.

Note: The USGenNet Search Engine only follows links that are on the USGenNet server (all domains). USGenNet also hosts the TNGen Search Engine, which searches all TNGenWeb Project sites, regardless of server.

Reminder: If you upload new or edited files to your site, they will not be searchable until the next time the USGenNet Search Engine crawls the server. In the future, USGenNet will announce each time the server or specific domains have been crawled.

II. How to Create Custom Searches for your USGenNet Web Site

Note: Most of the following information, examples and code references the USGenNet.Org domain, but applies to all domains on USGenNet's server. USGenNet Webmasters needing assistance with search engine codes should subscribe to the Web-Help Mailing List.

To create a simple search engine
for your USGenNet.Org county web site:

1. Copy/Paste the following code into your page:

<P>
<FORM METHOD="post" ACTION="http://www.usgennet.org/cgi-bin/htsearch">
<FONT SIZE="-1">
Match:
<SELECT NAME="method">
<OPTION VALUE="and">All
<OPTION VALUE="or">Any
<OPTION VALUE="boolean">Boolean
</SELECT>
Format:
<SELECT NAME="format">
<OPTION VALUE="builtin-long">Long
<OPTION VALUE="builtin-short">Short
</SELECT>
Sort by:
<SELECT NAME="sort">
<OPTION VALUE="score">Score
<OPTION VALUE="time">Time
<OPTION VALUE="title">Title
<OPTION VALUE="revscore">Reverse Score
<OPTION VALUE="revtime">Reverse Time
<OPTION VALUE="revtitle">Reverse Title
</SELECT>
</FONT>
<INPUT TYPE="hidden" NAME="config" VALUE="USGenNet">
<INPUT TYPE="hidden" NAME="restrict" VALUE="/yourstate/county/yourcounty/">
<INPUT TYPE="hidden" NAME="exclude" VALUE="">
<P>
Search:
<INPUT TYPE="text" SIZE="30" NAME="words" VALUE="">
<INPUT TYPE="submit" VALUE="Search">
</FORM>
<P>

2. Once you have added the above search engine code, the "restrict" line will need to be changed from yourstate to the applicable 2-character state designation, and from: yourcounty to the name of your county site. For example, the "restrict" line code for Peoria Co, ILGenWeb is:

<INPUT TYPE="hidden" NAME="restrict" VALUE="/il/county/peoria/">

To create a simple search engine
for your USGenNet.Org state web site:

Copy/Paste the above search engine code, but change the "restrict" line to:

<INPUT TYPE="hidden" NAME="restrict" VALUE="/yourstate/state/"> 
               [or /state1/ or /state2/ etc.]

Note:The "restrict" line will always need to be edited, and sometimes the "exclude" line may also need editing (see below). Entries listed in "restrict" are the directories or subdirectories you wish to include in your search site whereas entries under "exclude" (see below) are those you wish to exclude.

To create more complex search engines
for USGenNet.Org web sites:

USGenNet webmasters can create almost any combination of custom search engines for their web sites using the "restrict" and "exclude" lines. For example, in addition to a county-wide search, a webmaster can include special searches for a marriages sub-directory or revolutionary war sub-directories, etc. For example, Perry County, Mississippi's Search Perry! uses the following "restrict" code to create several special searches in addition to their county-wide search:

<BR><INPUT TYPE="radio" NAME="restrict" value="/ms/county/perry/cemeteries/">Cemeteries
<BR><INPUT TYPE="radio" NAME="restrict" value="/ms/county/perry/census/">Census
<BR><INPUT TYPE="radio" NAME="restrict" value="/ms/county/perry/civilwar/">Civil War
<BR><INPUT TYPE="radio" NAME="restrict" value="/ms/county/perry/school">Schools
<BR><INPUT TYPE="radio" NAME="restrict" value="/ms/county/perry/spanishamericanwar/">Spanish American War
<BR><INPUT TYPE="radio" NAME="restrict" value="/ms/county/perry/tax">Tax Records
<BR><INPUT TYPE="radio" NAME="restrict" value="/ms/county/perry/wpa/">WPA Transcriptions
<INPUT TYPE="radio" NAME="restrict" value="/ms/county/perry/ww1/">World War I

In another example, the Tuscola County, MI website on USGenNet has created a new Search Tuscola! search engine site that includes both county-wide and township search engines plus a special search of an online book, and one each for death and marriage search engines. The Tuscola site also includes a Wayne County, MI book (/tuscola/det/)and Wayne County marriages (/tuscola/waymar/). By using the "exclude" feature, the Tuscola webmaster was able to exclude the Wayne records from the Tuscola searches, and the Wayne County, MI webmaster able to add those marriages and the book to a Search Wayne! site.

The Tuscola "exclude" line:

<INPUT TYPE="hidden" NAME="exclude" VALUE="|/mi/county/tuscola/waymar/|/mi/county/tuscola/det/|">

The Wayne "restrict" (include) line:

<INPUT TYPE="hidden" NAME="restrict" VALUE="|/mi/county/wayne/|/mi/county/tuscola/det/|/mi/county/tuscola/waymar/|">

Note: The | mark is a separator used when searching more than one directory/sub-directory.

In yet another example, the Sullivan County, TN webmasters on USGenNet have created a special Search Sullivan! site that searches Sullivan Co, TNGenWeb, Sullivan Co, TN ALHN, the Combs-Coombs &c. Families of Sullivan Co, TN and a TNGenWeb Special Project that includes Sullivan County records:

<INPUT TYPE="hidden" NAME="restrict" VALUE="|www.tngenweb.org/sullivan/|/records/tn-sull|/tnland/squabble">

Note: "Restrictions" need only include enough information to ensure the correct directories are being searched. Because numerous counties in the U.S. are named Sullivan, it was necessary to add the domain name to this "restrict" line for Sullivan Co, TNGenWeb, but only necessary to include the sub-directory structure (/tnland/squabble/) for the TNGenWeb Special Project with Sullivan records.

If multiple webmasters in adjacent counties, or different projects wish to "join forces," they can create individual search engines for their individual county sites, and also create a special search that searches multiple county sites. For example, Peoria Co, ILGenWeb's Search Peoria! site includes both the Peoria County, ILGenWeb site and the USGenWeb Census Project's Peoria County, Illinois transcriptions. Likewise, State webmasters can create special searches for regions within their state or all Civil War records, etc. (See Special Topics below)

IMPORTANT: If you create a search engine for your site that includes other sites on USGenNet, you must include a reciprocal link to the other site(s) included in your searches.

III. Special and Topical Sites

The above search code examples show how the use of sub-directories increases possible uses of search engine code. Webmasters with special or topical sites can also use "standardized" naming patterns for sub-directories in order to be included in USGenNet's server-wide Special and Topical Searches For example, some common directory and sub-directory names used by USGenNet county webmasters are:

afro-amer
bible
bios
births
cemetery or cemeteries
census
children

civilwar and cw
deaths
deeds
folklore
land
letters
marriages

migrations
military
native-amer
obits
records
revwar and rw

surnames
tax
trails
war
wills
wpa

Use of the above names for directories will automatically result in inclusion in USGenNet's Special and Topical Searches.

Also note that judicious use of the / mark is helpful. For example, to create a search code for all cemetery sites, instead of entering /cemetery/ and /cemeteries/ in your "restrict" line, you can instead enter /cemeter (leaving off the last / mark) in order to find all cemetery and cemeteries sub-directories. This is also advisable for marriage versus marriages, migration versus migrations, etc. For example, <INPUT TYPE="hidden" NAME="restrict" VALUE="|/cemeter|"> will search all sub-directories that begin with cemeter, whereas <INPUT TYPE="hidden" NAME="restrict" VALUE="|/cemeter/|"> will only search sub-directories with the expect spelling, "cemeter".

IV. Miscellaney

Although there are never any guarantees when it comes to Internet web crawlers, "technically" you can exclude a specified page from being searched by adding a META tag to the <HEAD> of the file. Example:

<META name="ROBOTS" content="NOINDEX, NOFOLLOW">

USGenNet's HTDIG in-house HTDIG search engine responds to the above META tag and also permits exclusion of specific text within a page, likewise by the use of HTML code:

[Text you don't want indexed]

Use of this last feature can be particularly helpful if you wish to exclude "standard" language on each page (such as copyright language or the name of the webmaster, etc.) from being searched.