Различия
Здесь показаны различия между двумя версиями данной страницы.
Предыдущая версия справа и слева Предыдущая версия Следующая версия | Предыдущая версия | ||
wiki:tesseract [2012/07/20 10:31] [Описание] исправил опечатку |
wiki:tesseract [2017/03/22 20:56] |
||
---|---|---|---|
Строка 1: | Строка 1: | ||
======== Tesseract ======== | ======== Tesseract ======== | ||
+ | ''tesseract'' - консольный OCR движок. | ||
- | ''tesseract'' - консольный OCR движок | + | ==== Описание ==== |
+ | ''Tesseract'' является качественным консольным OCR движком с открытым исходным кодом. В настоящий момент программа работает с UTF-8, поддержка языков (включая русский с версии 3.0) осуществляется с помощью дополнительных модулей. | ||
+ | |||
+ | Существуют несколько графических интерфейсов (GUI) для программы: //gImageReader, OCRFeeder, YAGF//. | ||
==== Синтаксис ==== | ==== Синтаксис ==== | ||
<code bash>tesseract imagename outbase [-l язык] [-psm N] [configfile ...]</code> | <code bash>tesseract imagename outbase [-l язык] [-psm N] [configfile ...]</code> | ||
- | ==== Описание ==== | + | ==== Опции ==== |
+ | <code bash>imagename</code> | ||
+ | The name of the input image. Most image file formats (anything readable by Leptonica) are supported. | ||
- | ''tesseract(1)'' является качественным коммерческим OCR движком, оригинально разработанным HP между 1985 и 1995. В 1995, этот движок был в топ-3 по оценке UNLV. Исходные тексты были открыты HP и UNLV в 2005-м, и с тех пор дорабатываются Google. | + | <code bash>outbase</code> |
- | ((''tesseract(1)'' is a commercial quality OCR engine originally developed at HP between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by UNLV. It was open-sourced by HP and UNLV in 2005, and has been developed at Google since then.)) | + | The basename of the output file (to which the appropriate extension will be appended). By default the output will be named outbase.txt. |
- | ==== OPTIONS ==== | + | |
- | imagename | + | |
- | The name of the input image. Most image file formats (anything | + | |
- | readable by Leptonica) are supported. | + | |
- | outbase | + | <code bash>-l lang</code> |
- | The basename of the output file (to which the appropriate extension | + | |
- | will be appended). By default the output will be named outbase.txt. | + | |
- | -l lang | + | The language to use. If none is specified, English is assumed. Multiple languages may be specified, separated by plus characters. Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES) |
- | The language to use. If none is specified, English is assumed. | + | |
- | Multiple languages may be specified, separated by plus characters. | + | |
- | Tesseract uses 3-character ISO 639-2 language codes. (See | + | |
- | LANGUAGES) | + | |
- | -psm N | + | <code bash>-psm N</code> |
- | Set Tesseract to only run a subset of layout analysis and assume a | + | Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are: |
- | certain form of image. The options for N are: | + | |
- | 0 = Orientation and script detection (OSD) only. | + | * 0 = Orientation and script detection (OSD) only. |
- | 1 = Automatic page segmentation with OSD. | + | * 1 = Automatic page segmentation with OSD. |
- | 2 = Automatic page segmentation, but no OSD, or OCR. | + | * 2 = Automatic page segmentation, but no OSD, or OCR. |
- | 3 = Fully automatic page segmentation, but no OSD. (Default) | + | * 3 = Fully automatic page segmentation, but no OSD. (Default) |
- | 4 = Assume a single column of text of variable sizes. | + | * 4 = Assume a single column of text of variable sizes. |
- | 5 = Assume a single uniform block of vertically aligned text. | + | * 5 = Assume a single uniform block of vertically aligned text. |
- | 6 = Assume a single uniform block of text. | + | * 6 = Assume a single uniform block of text. |
- | 7 = Treat the image as a single text line. | + | * 7 = Treat the image as a single text line. |
- | 8 = Treat the image as a single word. | + | * 8 = Treat the image as a single word. |
- | 9 = Treat the image as a single word in a circle. | + | * 9 = Treat the image as a single word in a circle. |
- | 10 = Treat the image as a single character. | + | * 10 = Treat the image as a single character. |
- | -v | + | <code bash>-v</code> |
- | Returns the current version of the tesseract(1) executable. | + | Returns the current version of the tesseract(1) executable. |
- | configfile | + | <code bash>configfile</code> |
- | The name of a config to use. A config is a plaintext file which | + | |
- | contains a list of variables and their values, one per line, with a | + | |
- | space separating variable from value. Interesting config files | + | |
- | include: | + | |
- | o hocr - Output in hOCR format instead of as a text file. | + | The name of a config to use. A config is a plaintext file which contains a list of variables and their values, one per line, with a space separating variable from value. Interesting config files include: |
- | Nota Bene: The options -l lang and -psm N must occur before any | + | o hocr - Output in hOCR format instead of as a text file. |
- | configfile. | + | |
+ | <note>Nota Bene: The options -l lang and -psm N must occur before any configfile.</note> | ||
==== Языки ==== | ==== Языки ==== | ||
- | There are currently language packs available for the following | + | There are currently language packs available for the following languages: |
- | languages: | + | |
- | ara (Arabic), aze (Azerbauijani), bul (Bulgarian), cat (Catalan), ces | + | - ara (Arabic), |
- | (Czech), chi_sim (Simplified Chinese), chi_tra (Traditional Chinese), | + | - aze (Azerbauijani), |
- | chr (Cherokee), dan (Danish), dan-frak (Danish (Fraktur)), deu | + | - bul (Bulgarian), |
- | (German), ell (Greek), eng (English), enm (Old English), epo | + | - cat (Catalan), |
- | (Esperanto), est (Estonian), fin (Finnish), fra (French), frm (Old | + | - ces (Czech), |
- | French), glg (Galician), heb (Hebrew), hin (Hindi), hrv (Croation), hun | + | - chi_sim (Simplified Chinese), |
- | (Hungarian), ind (Indonesian), ita (Italian), jpn (Japanese), kor | + | - chi_tra (Traditional Chinese), |
- | (Korean), lav (Latvian), lit (Lithuanian), nld (Dutch), nor | + | - chr (Cherokee), |
- | (Norwegian), pol (Polish), por (Portuguese), ron (Romanian), rus | + | - dan (Danish), |
- | (Russian), slk (Slovakian), slv (Slovenian), sqi (Albanian), spa | + | - dan-frak (Danish (Fraktur)), |
- | (Spanish), srp (Serbian), swe (Swedish), tam (Tamil), tel (Telugu), tgl | + | - deu (German), |
- | (Tagalog), tha (Thai), tur (Turkish), ukr (Ukrainian), vie (Vietnamese) | + | - ell (Greek), |
+ | - eng (English), | ||
+ | - enm (Old English), | ||
+ | - epo (Esperanto), | ||
+ | - est (Estonian), | ||
+ | - fin (Finnish), | ||
+ | - fra (French), | ||
+ | - frm (Old French), | ||
+ | - glg (Galician), | ||
+ | - heb (Hebrew), | ||
+ | - hin (Hindi), | ||
+ | - hrv (Croation), | ||
+ | - hun (Hungarian), | ||
+ | - ind (Indonesian), | ||
+ | - ita (Italian), | ||
+ | - jpn (Japanese), | ||
+ | - kor (Korean), | ||
+ | - lav (Latvian), | ||
+ | - lit (Lithuanian), | ||
+ | - nld (Dutch), | ||
+ | - nor (Norwegian), | ||
+ | - pol (Polish), | ||
+ | - por (Portuguese), | ||
+ | - ron (Romanian), | ||
+ | - rus (Russian), | ||
+ | - slk (Slovakian), | ||
+ | - slv (Slovenian), | ||
+ | - sqi (Albanian), | ||
+ | - spa (Spanish), | ||
+ | - srp (Serbian), | ||
+ | - swe (Swedish), | ||
+ | - tam (Tamil), | ||
+ | - tel (Telugu), | ||
+ | - tgl (Tagalog), | ||
+ | - tha (Thai), | ||
+ | - tur (Turkish), | ||
+ | - ukr (Ukrainian), | ||
+ | - vie (Vietnamese) | ||
- | To use a non-standard language pack named foo.traineddata, set the | + | To use a non-standard language pack named foo.traineddata, set the TESSDATA_PREFIX environment variable so the file can be found at TESSDATA_PREFIX/tessdata/foo.traineddata and give Tesseract the argument -l foo. |
- | TESSDATA_PREFIX environment variable so the file can be found at | + | |
- | TESSDATA_PREFIX/tessdata/foo.traineddata and give Tesseract the | + | |
- | argument -l foo. | + | |
==== История ==== | ==== История ==== | ||
+ | ''Tesseract'' был разработан компанией HP между 1985 и 1995, а затем десять лет не изменялся. В 2005 году были открыты исходные тексты. С 2006 года разработку движка спонсирует компания Google. | ||
+ | |||
The engine was developed at Hewlett Packard Laboratories Bristol and at | The engine was developed at Hewlett Packard Laboratories Bristol and at | ||
Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some | Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some | ||
Строка 109: | Строка 135: | ||
==== Ресурсы ==== | ==== Ресурсы ==== | ||
- | Main web site: http://code.google.com/p/tesseract-ocr/ Information on | + | * Сайт проекта: https://github.com/tesseract-ocr |
- | training: | + | * Документация: https://github.com/tesseract-ocr/tesseract/wiki |
- | http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 | + | * Википедия: https://ru.wikipedia.org/wiki/Tesseract |
==== Смотрите также ==== | ==== Смотрите также ==== | ||
ambiguous_words(1), cntraining(1), combine_tessdata(1), | ambiguous_words(1), cntraining(1), combine_tessdata(1), | ||
Строка 129: | Строка 154: | ||
Samuel Charron, Sheelagh Lloyd, Shobhit Saxena, and Thomas Kielbus. | Samuel Charron, Sheelagh Lloyd, Shobhit Saxena, and Thomas Kielbus. | ||
- | ==== COPYING ==== | + | ==== Копирование ==== |
- | Licensed under the Apache License, Version 2.0 | + | |
+ | Зарегистрирован под лицензией //Apache License, Version 2.0// | ||
{{tag>tesseract}} | {{tag>tesseract}} |