I. How to Get the Most out of USGenNet's Search Engine
(also applies to all Internet web-crawlers)
USGenNet's Search Engine is built upon HtDig, a popular and
flexible collection of WWW search engine software. This software is
designed to function in a manner similar to common Internet-based
search engines. This includes the use of a web-crawler for retrieving
and indexing the content of web pages.
What is a web-crawler?
Before your site can be searched, it must first be "crawled" (also known as
"spidered" or "indexed"). Only after your site has been crawled for
words and hyper-links, and that information indexed, is it possible for
the site to be searched. But (an important one) when a web-crawler visits
your site, it ignores web pages that cannot be reached by following links
specified in other web pages. Often the crawl begins at the site's default
page.
What are default pages?
Default pages are the "front door" to both your web account and all its
sub-directories. In general, a well designed site will be laid out in
a manner allowing a user starting at the default page to reach any other
page on the site simply by clicking on links. A web-crawler assumes this
layout and will miss pages that do not satisfy this expectation. Put
simply, if you cannot reach a page on your site by following links from
your default page, it is likely that a web-crawler will never see that
page.
Default pages are defined by using special file names that the web
server is configured to recognize. Although index.html is likely the
most popular choice at this time, the USGenNet web server is configured
to recognizes all of the following as default pages:
index.htm/index.html
default.htm/default.html
home.htm/home.html
How do web crawlers work?
Once a web-crawler enters your site (through the front door of the default
page), it searches your site by following links. If the file containing your
home page is not named in a manner that the web server recognizes as a default
page, or if you have files that are not linked from any other page on your web
site, the result will be an incomplete crawl. This in turn means that some
or all of the content of your site will not be available during a search.
In other words, if your home page has a default name, the web-crawler begins
there, first reading that entire page, and then indexing the content. It also
builds a list of all the links it finds on that page. It next follows those
links and indexes the corresponding pages, adding any new links it finds to
a running list. Once the list of links is exhausted, the crawler assumes it
has gathered everything of interest.
Note: The USGenNet Search Engine only
follows links that are on the USGenNet server (all domains). USGenNet also hosts
the TNGen Search Engine,
which searches all TNGenWeb Project sites, regardless of server.
Reminder: If you upload new or edited files to your site, they will not
be searchable until the next time the USGenNet Search Engine crawls
the server. In the future, USGenNet will announce each time the server or
specific domains have been crawled.
II. How to Create Custom Searches for your USGenNet
Web Site
Note: Most of the following information, examples and code references the
USGenNet.Org domain, but applies to all domains on USGenNet's server. USGenNet
Webmasters needing assistance with search engine codes should subscribe to the
Web-Help Mailing List.
To create a simple search engine
for your USGenNet.Org county web site:
1. Copy/Paste the following code into your page:
<P>
<FORM METHOD="post" ACTION="http://www.usgennet.org/cgi-bin/htsearch">
<FONT SIZE="-1">
Match:
<SELECT NAME="method">
<OPTION VALUE="and">All
<OPTION VALUE="or">Any
<OPTION VALUE="boolean">Boolean
</SELECT>
Format:
<SELECT NAME="format">
<OPTION VALUE="builtin-long">Long
<OPTION VALUE="builtin-short">Short
</SELECT>
Sort by:
<SELECT NAME="sort">
<OPTION VALUE="score">Score
<OPTION VALUE="time">Time
<OPTION VALUE="title">Title
<OPTION VALUE="revscore">Reverse Score
<OPTION VALUE="revtime">Reverse Time
<OPTION VALUE="revtitle">Reverse Title
</SELECT>
</FONT>
<INPUT TYPE="hidden" NAME="config" VALUE="USGenNet">
<INPUT TYPE="hidden" NAME="restrict" VALUE="/yourstate/county/yourcounty/">
<INPUT TYPE="hidden" NAME="exclude" VALUE="">
<P>
Search:
<INPUT TYPE="text" SIZE="30" NAME="words" VALUE="">
<INPUT TYPE="submit" VALUE="Search">
</FORM>
<P>
2. Once you have added the above search engine code, the "restrict" line will need to be changed from
yourstate
to the applicable 2-character state designation, and from: yourcounty to the name of your county site. For
example, the "restrict" line code for Peoria Co, ILGenWeb is:
<INPUT TYPE="hidden" NAME="restrict" VALUE="/il/county/peoria/">
To create a simple search engine
for your USGenNet.Org state web site:
Copy/Paste the above search engine code, but change the "restrict" line to:
<INPUT TYPE="hidden" NAME="restrict" VALUE="/yourstate/state/">
[or /state1/ or /state2/ etc.]
Note:The "restrict" line will always need to be edited, and sometimes the "exclude" line may also need
editing (see below). Entries listed in "restrict" are the directories or subdirectories you wish to include
in your search site whereas entries under "exclude" (see below) are those you wish to exclude.
To create more complex search engines
for USGenNet.Org web sites:
USGenNet webmasters can create almost any combination of custom search engines for
their web sites using the "restrict" and "exclude" lines. For example, in addition to a
county-wide search, a webmaster can include special searches for a marriages
sub-directory or revolutionary war sub-directories, etc. For example, Perry County,
Mississippi's Search Perry! uses the following "restrict"
code to create several special searches in addition to their county-wide search:
<BR><INPUT TYPE="radio" NAME="restrict" value="/ms/county/perry/cemeteries/">Cemeteries
<BR><INPUT TYPE="radio" NAME="restrict" value="/ms/county/perry/census/">Census
<BR><INPUT TYPE="radio" NAME="restrict" value="/ms/county/perry/civilwar/">Civil War
<BR><INPUT TYPE="radio" NAME="restrict" value="/ms/county/perry/school">Schools
<BR><INPUT TYPE="radio" NAME="restrict" value="/ms/county/perry/spanishamericanwar/">Spanish American War
<BR><INPUT TYPE="radio" NAME="restrict" value="/ms/county/perry/tax">Tax Records
<BR><INPUT TYPE="radio" NAME="restrict" value="/ms/county/perry/wpa/">WPA Transcriptions
<INPUT TYPE="radio" NAME="restrict" value="/ms/county/perry/ww1/">World War I
In another example, the Tuscola County, MI website on USGenNet has created a
new Search Tuscola! search engine site that includes both
county-wide and township search engines plus a special search of an online book, and
one each for death and marriage search engines. The Tuscola site also includes a Wayne
County, MI book (/tuscola/det/)and Wayne County marriages (/tuscola/waymar/). By using
the "exclude" feature, the Tuscola webmaster was able to exclude the Wayne records from
the Tuscola searches, and the Wayne County, MI webmaster able to add those marriages
and the book to a Search Wayne! site.
The Tuscola "exclude" line:
<INPUT TYPE="hidden" NAME="exclude" VALUE="|/mi/county/tuscola/waymar/|/mi/county/tuscola/det/|">
The Wayne "restrict" (include) line:
<INPUT TYPE="hidden" NAME="restrict" VALUE="|/mi/county/wayne/|/mi/county/tuscola/det/|/mi/county/tuscola/waymar/|">
Note: The | mark is a separator used when searching more than one directory/sub-directory.
In yet another example, the Sullivan County, TN webmasters on USGenNet have created a special
Search Sullivan! site that searches Sullivan Co, TNGenWeb, Sullivan Co, TN ALHN,
the Combs-Coombs &c. Families of Sullivan Co, TN and a TNGenWeb Special Project that includes Sullivan County
records:
<INPUT TYPE="hidden" NAME="restrict" VALUE="|www.tngenweb.org/sullivan/|/records/tn-sull|/tnland/squabble">
Note: "Restrictions" need only include enough information to ensure the correct directories are being
searched. Because numerous counties in the U.S. are named Sullivan, it was necessary to add the domain name to
this "restrict" line for Sullivan Co, TNGenWeb, but only necessary to include the sub-directory structure
(/tnland/squabble/) for the TNGenWeb Special Project with Sullivan records.
If multiple webmasters in adjacent counties, or different projects
wish to "join forces," they can create individual search engines for their individual
county sites, and also create a special search that searches multiple county sites.
For example, Peoria Co, ILGenWeb's Search Peoria!
site includes both the Peoria County, ILGenWeb site and
the USGenWeb Census Project's Peoria County,
Illinois transcriptions. Likewise, State webmasters can create special searches for
regions within their state or all Civil War records, etc. (See Special Topics below)
IMPORTANT: If you create a search engine for your site that includes other sites
on USGenNet, you must include a reciprocal link to the other site(s) included in your searches.
III. Special and Topical Sites
The above search code examples show how the use of sub-directories increases possible
uses of search engine code. Webmasters with special or topical sites can also use
"standardized" naming patterns for sub-directories in order to be included in
USGenNet's server-wide Special and Topical Searches For
example, some common directory and sub-directory names used by USGenNet county webmasters are:
afro-amer
bible
bios
births
cemetery or cemeteries
census
children
|
civilwar and cw
deaths
deeds
folklore
land
letters
marriages
|
migrations
military
native-amer
obits
records
revwar and rw
|
surnames
tax
trails
war
wills
wpa |
Use of the above names for directories will automatically result in inclusion in USGenNet's Special and Topical
Searches.
Also note that judicious use of the / mark is helpful. For example, to create a search code for all cemetery sites,
instead of entering /cemetery/ and /cemeteries/ in your "restrict" line, you can instead enter
/cemeter (leaving off the last / mark) in order to find all cemetery and cemeteries sub-directories.
This is also advisable for marriage versus marriages, migration versus migrations, etc. For example,
<INPUT TYPE="hidden" NAME="restrict" VALUE="|/cemeter|"> will search all sub-directories that begin
with cemeter, whereas <INPUT TYPE="hidden" NAME="restrict" VALUE="|/cemeter/|"> will only search
sub-directories with the expect spelling, "cemeter".
IV. Miscellaney
Although there are never any guarantees when it comes to Internet web crawlers, "technically" you can exclude a
specified page from being searched by adding a META tag to the <HEAD> of the file. Example:
<META name="ROBOTS" content="NOINDEX, NOFOLLOW">
USGenNet's HTDIG in-house HTDIG search engine responds to the above META tag and also permits exclusion of specific
text within a page, likewise by the use of HTML code:
<!--htdig_noindex-->
[Text you don't want indexed]
<!--/htdig_noindex-->
Use of this last feature can be particularly helpful if you wish to exclude "standard" language on each page
(such as copyright language or the name of the webmaster, etc.) from being searched.
|