Indexing Non-English Content

Stork is not an inherently English-language-specific tool. If your content is in a different language, Stork can parse and index that content.

The only language-specific functionality of Stork is the stemmer: the functionality that lets you search for the word "tributaries" and get results for the word "tributary." In this case, Stork's stemmer will recognize "tributar" as the root of the word "tributary" and use that to improve search results.

This transformation from "tributaries" to "tributar" is English-specific. Stork supports stemming in the following languages:

Arabic
Danish
Dutch
English
Finnish
French
German
Greek
Hungarian
Italian
Norwegian
Portuguese
Romanian
Russian
Spanish
Swedish
Tamil
Turkish

The stemming language algorithm used for each file is determined by your Stork config file. With no language specified, Stork will default to English stemming. To change the global stemming algorithm, set it in your config file:

basic.toml (partial)

[input]
base_directory = "my_files/"
stemming = "Dutch"
files = [
    ...
]

All the files in your files array will be indexed using the Dutch stemming algorithm.

If you don't want the words in your content to be stemmed (or if it's in an unsupported language), you can also specify stemming = "None" in the configuration file to turn off stemming.

The stemming configuration can be set on a per-file basis, as well. For example, if you have three files in Spanish but one in French, you can specify that the Spanish stemming algorithm be used generally, but the French algorithm be used for the French language file:

longer.toml (partial)

[input]
base_directory = "my_files"
stemming = "Spanish"
 
[[input.files]] 
path = "document-1.txt"
url = "/document1.html"
title = "Mi primer documento"
 
[[input.files]]
path = "document-2.txt"
url = "/document2.html"
title = "Mon document en français"
stemming_override = "French"
 
[[input.files]]
path = "document-3.txt"
url = "/document3.html"
title = "Mi segundo documento"

Getting Started

Going Further

References

Indexing Non-English Content