Различия
Здесь показаны различия между двумя версиями данной страницы.
| Предыдущая версия справа и слева Предыдущая версия Следующая версия | Предыдущая версия | ||
|
wiki:tesseract [2012/07/20 10:30] [DESCRIPTION] перевод |
wiki:tesseract [2017/03/22 20:56] (текущий) |
||
|---|---|---|---|
| Строка 1: | Строка 1: | ||
| ======== Tesseract ======== | ======== Tesseract ======== | ||
| + | ''tesseract'' - консольный OCR движок. | ||
| - | ''tesseract'' - консольный OCR движок | + | ==== Описание ==== |
| + | ''Tesseract'' является качественным консольным OCR движком с открытым исходным кодом. В настоящий момент программа работает с UTF-8, поддержка языков (включая русский с версии 3.0) осуществляется с помощью дополнительных модулей. | ||
| + | |||
| + | Существуют несколько графических интерфейсов (GUI) для программы: //gImageReader, OCRFeeder, YAGF//. | ||
| ==== Синтаксис ==== | ==== Синтаксис ==== | ||
| <code bash>tesseract imagename outbase [-l язык] [-psm N] [configfile ...]</code> | <code bash>tesseract imagename outbase [-l язык] [-psm N] [configfile ...]</code> | ||
| - | ==== Описание ==== | + | ==== Опции ==== |
| + | <code bash>imagename</code> | ||
| + | The name of the input image. Most image file formats (anything readable by Leptonica) are supported. | ||
| - | ''tesseract(1)'' является качественным коммерческим OCR движком, оригинально разработанным HP между 1985 и 1995. В 1995, это движок был в топ-3 по оценке UNLV. Исходные тексты были открыты HP и UNLV в 2005-м, и с тех пор дорабатываются Google. | + | <code bash>outbase</code> |
| - | ((''tesseract(1)'' is a commercial quality OCR engine originally developed at HP between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by UNLV. It was open-sourced by HP and UNLV in 2005, and has been developed at Google since then.)) | + | The basename of the output file (to which the appropriate extension will be appended). By default the output will be named outbase.txt. |
| - | ==== OPTIONS ==== | + | |
| - | imagename | + | |
| - | The name of the input image. Most image file formats (anything | + | |
| - | readable by Leptonica) are supported. | + | |
| - | outbase | + | <code bash>-l lang</code> |
| - | The basename of the output file (to which the appropriate extension | + | |
| - | will be appended). By default the output will be named outbase.txt. | + | |
| - | -l lang | + | The language to use. If none is specified, English is assumed. Multiple languages may be specified, separated by plus characters. Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES) |
| - | The language to use. If none is specified, English is assumed. | + | |
| - | Multiple languages may be specified, separated by plus characters. | + | |
| - | Tesseract uses 3-character ISO 639-2 language codes. (See | + | |
| - | LANGUAGES) | + | |
| - | -psm N | + | <code bash>-psm N</code> |
| - | Set Tesseract to only run a subset of layout analysis and assume a | + | Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are: |
| - | certain form of image. The options for N are: | + | |
| - | 0 = Orientation and script detection (OSD) only. | + | * 0 = Orientation and script detection (OSD) only. |
| - | 1 = Automatic page segmentation with OSD. | + | * 1 = Automatic page segmentation with OSD. |
| - | 2 = Automatic page segmentation, but no OSD, or OCR. | + | * 2 = Automatic page segmentation, but no OSD, or OCR. |
| - | 3 = Fully automatic page segmentation, but no OSD. (Default) | + | * 3 = Fully automatic page segmentation, but no OSD. (Default) |
| - | 4 = Assume a single column of text of variable sizes. | + | * 4 = Assume a single column of text of variable sizes. |
| - | 5 = Assume a single uniform block of vertically aligned text. | + | * 5 = Assume a single uniform block of vertically aligned text. |
| - | 6 = Assume a single uniform block of text. | + | * 6 = Assume a single uniform block of text. |
| - | 7 = Treat the image as a single text line. | + | * 7 = Treat the image as a single text line. |
| - | 8 = Treat the image as a single word. | + | * 8 = Treat the image as a single word. |
| - | 9 = Treat the image as a single word in a circle. | + | * 9 = Treat the image as a single word in a circle. |
| - | 10 = Treat the image as a single character. | + | * 10 = Treat the image as a single character. |
| - | -v | + | <code bash>-v</code> |
| - | Returns the current version of the tesseract(1) executable. | + | Returns the current version of the tesseract(1) executable. |
| - | configfile | + | <code bash>configfile</code> |
| - | The name of a config to use. A config is a plaintext file which | + | |
| - | contains a list of variables and their values, one per line, with a | + | |
| - | space separating variable from value. Interesting config files | + | |
| - | include: | + | |
| - | o hocr - Output in hOCR format instead of as a text file. | + | The name of a config to use. A config is a plaintext file which contains a list of variables and their values, one per line, with a space separating variable from value. Interesting config files include: |
| - | Nota Bene: The options -l lang and -psm N must occur before any | + | o hocr - Output in hOCR format instead of as a text file. |
| - | configfile. | + | |
| + | <note>Nota Bene: The options -l lang and -psm N must occur before any configfile.</note> | ||
| ==== Языки ==== | ==== Языки ==== | ||
| - | There are currently language packs available for the following | + | There are currently language packs available for the following languages: |
| - | languages: | + | |
| - | ara (Arabic), aze (Azerbauijani), bul (Bulgarian), cat (Catalan), ces | + | - ara (Arabic), |
| - | (Czech), chi_sim (Simplified Chinese), chi_tra (Traditional Chinese), | + | - aze (Azerbauijani), |
| - | chr (Cherokee), dan (Danish), dan-frak (Danish (Fraktur)), deu | + | - bul (Bulgarian), |
| - | (German), ell (Greek), eng (English), enm (Old English), epo | + | - cat (Catalan), |
| - | (Esperanto), est (Estonian), fin (Finnish), fra (French), frm (Old | + | - ces (Czech), |
| - | French), glg (Galician), heb (Hebrew), hin (Hindi), hrv (Croation), hun | + | - chi_sim (Simplified Chinese), |
| - | (Hungarian), ind (Indonesian), ita (Italian), jpn (Japanese), kor | + | - chi_tra (Traditional Chinese), |
| - | (Korean), lav (Latvian), lit (Lithuanian), nld (Dutch), nor | + | - chr (Cherokee), |
| - | (Norwegian), pol (Polish), por (Portuguese), ron (Romanian), rus | + | - dan (Danish), |
| - | (Russian), slk (Slovakian), slv (Slovenian), sqi (Albanian), spa | + | - dan-frak (Danish (Fraktur)), |
| - | (Spanish), srp (Serbian), swe (Swedish), tam (Tamil), tel (Telugu), tgl | + | - deu (German), |
| - | (Tagalog), tha (Thai), tur (Turkish), ukr (Ukrainian), vie (Vietnamese) | + | - ell (Greek), |
| + | - eng (English), | ||
| + | - enm (Old English), | ||
| + | - epo (Esperanto), | ||
| + | - est (Estonian), | ||
| + | - fin (Finnish), | ||
| + | - fra (French), | ||
| + | - frm (Old French), | ||
| + | - glg (Galician), | ||
| + | - heb (Hebrew), | ||
| + | - hin (Hindi), | ||
| + | - hrv (Croation), | ||
| + | - hun (Hungarian), | ||
| + | - ind (Indonesian), | ||
| + | - ita (Italian), | ||
| + | - jpn (Japanese), | ||
| + | - kor (Korean), | ||
| + | - lav (Latvian), | ||
| + | - lit (Lithuanian), | ||
| + | - nld (Dutch), | ||
| + | - nor (Norwegian), | ||
| + | - pol (Polish), | ||
| + | - por (Portuguese), | ||
| + | - ron (Romanian), | ||
| + | - rus (Russian), | ||
| + | - slk (Slovakian), | ||
| + | - slv (Slovenian), | ||
| + | - sqi (Albanian), | ||
| + | - spa (Spanish), | ||
| + | - srp (Serbian), | ||
| + | - swe (Swedish), | ||
| + | - tam (Tamil), | ||
| + | - tel (Telugu), | ||
| + | - tgl (Tagalog), | ||
| + | - tha (Thai), | ||
| + | - tur (Turkish), | ||
| + | - ukr (Ukrainian), | ||
| + | - vie (Vietnamese) | ||
| - | To use a non-standard language pack named foo.traineddata, set the | + | To use a non-standard language pack named foo.traineddata, set the TESSDATA_PREFIX environment variable so the file can be found at TESSDATA_PREFIX/tessdata/foo.traineddata and give Tesseract the argument -l foo. |
| - | TESSDATA_PREFIX environment variable so the file can be found at | + | |
| - | TESSDATA_PREFIX/tessdata/foo.traineddata and give Tesseract the | + | |
| - | argument -l foo. | + | |
| ==== История ==== | ==== История ==== | ||
| + | ''Tesseract'' был разработан компанией HP между 1985 и 1995, а затем десять лет не изменялся. В 2005 году были открыты исходные тексты. С 2006 года разработку движка спонсирует компания Google. | ||
| + | |||
| The engine was developed at Hewlett Packard Laboratories Bristol and at | The engine was developed at Hewlett Packard Laboratories Bristol and at | ||
| Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some | Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some | ||
| Строка 109: | Строка 135: | ||
| ==== Ресурсы ==== | ==== Ресурсы ==== | ||
| - | Main web site: http://code.google.com/p/tesseract-ocr/ Information on | + | * Сайт проекта: https://github.com/tesseract-ocr |
| - | training: | + | * Документация: https://github.com/tesseract-ocr/tesseract/wiki |
| - | http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 | + | * Википедия: https://ru.wikipedia.org/wiki/Tesseract |
| ==== Смотрите также ==== | ==== Смотрите также ==== | ||
| ambiguous_words(1), cntraining(1), combine_tessdata(1), | ambiguous_words(1), cntraining(1), combine_tessdata(1), | ||
| Строка 129: | Строка 154: | ||
| Samuel Charron, Sheelagh Lloyd, Shobhit Saxena, and Thomas Kielbus. | Samuel Charron, Sheelagh Lloyd, Shobhit Saxena, and Thomas Kielbus. | ||
| - | ==== COPYING ==== | + | ==== Копирование ==== |
| - | Licensed under the Apache License, Version 2.0 | + | |
| + | Зарегистрирован под лицензией //Apache License, Version 2.0// | ||
| {{tag>tesseract}} | {{tag>tesseract}} | ||