Building a Search Index

To build a search interface that searches through your content, you first need to build an index file using the stork command line program. If you haven't installed the command-line program, visit the installation instructions to get started.

The command-line program will take in a configuration file and build an index file. An index file is a serialized, compressed blob that the Stork Javascript library will download and parse. It usually has the .st file extension, and is roughly the size of a JPEG.

The Configuration File

To build a search index, Stork requires that you put together a configuration file that describes which documents it should index. This configuration file is written in TOML.

The most basic Stork configuration specifies a list of files to be indexed and a filename to which the computed index will be written:

basic.toml
[input]
base_directory = "federalist"
url_prefix = "https://www.gutenberg.org/files/1404/1404-h/1404-h.htm#link2H_4_"
files = [
{path = "federalist-1.txt", url = "0001", title = "General Introduction"},
{path = "federalist-2.txt", url = "0002", title = "Concerning Dangers from Foreign Force and Influence"},
{path = "federalist-3.txt", url = "0003", title = "Concerning Dangers from Foreign Force and Influence 2"},
{path = "federalist-4.txt", url = "0004", title = "Concerning Dangers from Foreign Force and Influence 3"},
]

To build a search index from this configuration file, save the file to disk and pass it into the Stork command-line tool:

$ stork build --input basic.toml --output federalist.st

All file paths in the configuration file, including the base directory, are relative to the working directory when you run the Stork command. For example, if you're in the ~/project/ directory when you run $ stork build --input basic.toml --output -, Stork will look for the files at:

  • ~/project/my_files/federalist-1.txt
  • ~/project/my_files/federalist-2.txt
  • ~/project/my_files/federalist-3.txt

Testing your index

After writing a config file, you might want to test how well the search interface works before loading it onto your web site. Stork offers a test mode, where it will build your search index and load it into a simplified web interface so you can play with the search functionality while iterating on your configuration.

To test out your index, run:

$ stork test --config basic.toml
# or
$ stork test --index my-index.st

and open http://localhost:1612, the web page served by Stork.

Advanced file configuration

The Stork File object has many different parameters that can change the indexing behavior. However, the syntax shown above in basic.toml, described as an array of inline tables, requires that an entire file object is written on one line. If you have a lot of parameters in a file object, this could get unwieldy.

TOML syntax specifies another syntax -- the array of tables syntax -- that can be used to build up an array of file objects. This syntax might be preferable if your file objects have lots of parameters.

In this syntax, each object in the array is in its own multi-line block, and each parameter (and its value) is on its own line:

longer.toml
[input]
base_directory = "my_files"
url_prefix = "https://www.gutenberg.org/files/1404/1404-h/1404-h.htm#link2H_4_"
[[input.files]]
path = "federalist-1.txt"
url = "0001"
title = "General Introduction"
[[input.files]]
path = "federalist-2.txt"
url = "0002"
title = "Concerning Dangers from Foreign Force and Influence"
[[input.files]]
path = "federalist-3.txt"
url = "0003"
title = "Concerning Dangers from Foreign Force and Influence 2"
[[input.files]]
path = "federalist-4.txt"
url = "0004"
title = "Concerning Dangers from Foreign Force and Influence 3"

You might prefer this syntax for more complicated configurations.

File Formats

Stork's north star implementation would let you bring your content in whatever format it exists, and Stork would be able to recognize and index it with no preprocessing needed. Today, Stork can automatically recognize and extract text from four types of files:

  1. Plain text files,
  2. SRT subtitle files,
  3. HTML files, and
  4. Markdown files.

Stork will automatically detect the file format by inspecting its file extension; however, if your file extension is non-standard, you can specify the format of any file:

[[input.files]]
path = "federalist-1.mdx"
url = "TheFederalistPapers-1"
title = "Introduction"
filetype = "Markdown"

In the configuration file, you can specify that your files have Front Matter. You can configure Stork to omit frontmatter from the files it reads by adding the frontmatter_handling key to your configuration:

[input]
base_directory = "my_files"
url_prefix = "https://www.congress.gov/resources/display/content/The+Federalist+Papers#"
frontmatter_handling = "Omit"
files = [...]

You can also override frontmatter handling on a file-by-file basis; see the Configuration File Reference for details.

Additional Options

You can visit the Configuration File Reference to see the full list of acceptable configuration key-value pairs.

The Stork configuration file lets you control many subtle aspects of how Stork indexes your content and how the search interface behaves, such as:

  • Configure which language the stemmer uses to stem each word in your corpus
  • Configure how much the words in the title get a search rank boost
  • Configure how the content previews are displayed in the web interface

Next steps

Was this page helpful?

If you see an issue, please file a bug!

© 2019–2022. Stork is maintained by James Little, who's really excited that you're checking it out.

If you have any questions or comments, feel free to start a discussion on Github or chat about the project on Discord.

This site is open source. Please file a bug or open a PR if you see something confusing or incorrect.

Logo art by Bruno Monts, with special thanks to the fission.codes team. Please contact James Little before using the logo for anything.