Indexing Non-English Content

Stork is not an inherently English-language-specific tool. If your content is in a different language, Stork can parse and index that content.

The only language-specific functionality of Stork is the stemmer: the functionality that lets you search for the word "tributaries" and get results for the word "tributary." In this case, Stork's stemmer will recognize "tributar" as the root of the word "tributary" and use that to improve search results.

This transformation from "tributaries" to "tributar" is English-specific. Stork supports stemming in the following languages:

  • Arabic
  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Greek
  • Hungarian
  • Italian
  • Norwegian
  • Portuguese
  • Romanian
  • Russian
  • Spanish
  • Swedish
  • Tamil
  • Turkish

The stemming language algorithm used for each file is determined by your Stork config file. With no language specified, Stork will default to English stemming. To change the global stemming algorithm, set it in your config file:

basic.toml (partial)
[input]
base_directory = "my_files/"
stemming = "Dutch"
files = [
...
]

All the files in your files array will be indexed using the Dutch stemming algorithm.

If you don't want the words in your content to be stemmed (or if it's in an unsupported language), you can also specify stemming = "None" in the configuration file to turn off stemming.

The stemming configuration can be set on a per-file basis, as well. For example, if you have three files in Spanish but one in French, you can specify that the Spanish stemming algorithm be used generally, but the French algorithm be used for the French language file:

longer.toml (partial)
[input]
base_directory = "my_files"
stemming = "Spanish"
[[input.files]]
path = "document-1.txt"
url = "/document1.html"
title = "Mi primer documento"
[[input.files]]
path = "document-2.txt"
url = "/document2.html"
title = "Mon document en français"
stemming_override = "French"
[[input.files]]
path = "document-3.txt"
url = "/document3.html"
title = "Mi segundo documento"

© 2019–2021

Stork is built and shepherded by James Little, who's really excited that you're checking it out. If you have any questions or comments, feel free to tweet at him or open an issue on Github.

This site is open source. Please file a bug or open a PR if you see something confusing or incorrect. PRs are always welcome!

Logo by Bruno Monts. Please contact James Little before using the Stork logo. Thanks to Bruno and the fission.codes team for making this logo happen.