OCR languages
Audiveris delegates text recognition to Tesseract OCR library.
Whether you install Audiveris via its Windows installer or download the project and build it locally from source, you will need to have a local copy of some Tesseract language files:
eng
(English) is mandatory,deu
(German),fra
(French),ita
(Italian) are often useful.
Note that you can always download additional languages later from the dedicated Tesseract tessdata site which contains data files for 100+ languages.
Table of contents
Tesseract as a linked library
Audiveris calls Tesseract software as a linked binary library, not as a separate executable program.
- Tesseract library is automatically provided via Audiveris installation or building. As of this writing, Audiveris uses version 5.3.1 of Tesseract library.
- There is thus no need to install any Tesseract program. You may already have a Tesseract program of whatever version installed on our machine, Audiveris will not interfere with that program.
- However, Tesseract library will need data (language files) which must be provided separately.
Data version
Tesseract OCR engine can operate in two different OCR modes – the legacy
mode and the new LSTM
mode – each with its own model.
To process text scattered among musical symbols, Audiveris must use the legacy
mode.
The language files downloadable from Tesseract tessdata page are meant for Tesseract version 4.x and up, each language file containing both legacy
and LSTM
models.
So, these are the language data files that Audiveris requires.
Data location
At starting time, Audiveris tries to initialize the Tesseract library with a tessdata
folder:
- It first checks the location defined by the
TESSDATA_PREFIX
environment variable. - If not found there, it tries the Tesseract tessdata default location according to the OS, which for Windows can be for example
"C:\Program Files\tesseract-ocr\tessdata"
.
If in doubt, we recommend the following actions:
- Choose or create a specific folder, named
tessdata
for clarity. - Download a few language files (at least
eng.traineddata
file) from Tesseract tessdata page to your specific folder. - Define the
TESSDATA_PREFIX
environment variable to point to your specific folder.
For illustration purpose, here is a personal configuration:
- I have created a
"tessdata"
sub-folder in Audiveris userconfig
folder.$ echo $TESSDATA_PREFIX C:\Users\herve\AppData\Roaming\AudiverisLtd\audiveris\config\tessdata
- And downloaded just four files from Tesseract tessdata web page into it.
$ ls -lgh $TESSDATA_PREFIX total 66M -rw-r--r-- 1 herve 15M Jun 22 14:15 deu.traineddata -rw-r--r-- 1 herve 23M Jun 22 12:18 eng.traineddata -rw-r--r-- 1 herve 14M Jun 22 14:16 fra.traineddata -rw-r--r-- 1 herve 16M Jun 22 14:16 ita.traineddata
The About dialog, launched from the Help | About
pulldown menu, displays key information about the OCR engine version and OCR tessdata folder:
Languages selection
At runtime, you can specify which languages should be tried by the OCR software.
This is done via a language specification string, a plus-separated list of language names:
-
The easiest way is to define this language specification interactively.
Using theBook | Set Book Parameters
menu, you can make specifications at the global level, book level and even individual sheet level.
Depending upon the language files present in your localtessdata
folder, you will be presented with the list of languages available for selection. -
The default (global) specification is determined by the application constant
org.audiveris.omr.text.Language.defaultSpecification
, whose initial value isdeu+eng+fra
.
Thus, you can also modify this default directly by changing the constant value:- either interactively (using
Tools | Options
menu) - or in batch (using something like
-option org.audiveris.omr.text.Language.defaultSpecification=ita+eng
).
- either interactively (using