Различия
Здесь показаны различия между двумя версиями данной страницы.
Следующая версия | Предыдущая версия Последняя версия Следующая версия справа и слева | ||
wiki:tesseract [2012/07/20 10:13] создано |
wiki:tesseract [2017/03/22 20:49] [Описание] |
||
---|---|---|---|
Строка 1: | Строка 1: | ||
- | ====== Tesseract ====== | + | ======== Tesseract ======== |
- | tesseract - консольный OCR движок | + | ''tesseract'' - консольный OCR движок |
- | === Синтаксис === | + | ==== Синтаксис ==== |
- | <code bash> | + | <code bash>tesseract imagename outbase [-l язык] [-psm N] [configfile ...]</code> |
- | tesseract imagename outbase [-l lang] [-psm N] [configfile ...] | + | |
- | </code> | + | |
- | === DESCRIPTION === | + | ==== Описание ==== |
- | tesseract(1) is a commercial quality OCR engine originally developed at | + | |
- | HP between 1985 and 1995. In 1995, this engine was among the top 3 | + | |
- | evaluated by UNLV. It was open-sourced by HP and UNLV in 2005, and has | + | |
- | been developed at Google since then. | + | |
- | === OPTIONS === | + | ''Tesseract'' является качественным OCR движком с открытым исходным кодом. В настоящий момент программа работает с UTF-8, поддержка языков (включая русский с версии 3.0) осуществляется с помощью дополнительных модулей. |
- | imagename | + | |
- | The name of the input image. Most image file formats (anything | + | |
- | readable by Leptonica) are supported. | + | |
- | outbase | + | ''Tesseract'' был разработан компанией HP между 1985 и 1995, а затем десять лет не изменялся. В 2005 году были открыты исходные тексты. С 2006 года разработку движка спонсирует компания Google. |
- | The basename of the output file (to which the appropriate extension | + | ==== Опции ==== |
- | will be appended). By default the output will be named outbase.txt. | + | <code bash>imagename</code> |
+ | The name of the input image. Most image file formats (anything readable by Leptonica) are supported. | ||
- | -l lang | + | <code bash>outbase</code> |
- | The language to use. If none is specified, English is assumed. | + | The basename of the output file (to which the appropriate extension will be appended). By default the output will be named outbase.txt. |
- | Multiple languages may be specified, separated by plus characters. | + | |
- | Tesseract uses 3-character ISO 639-2 language codes. (See | + | |
- | LANGUAGES) | + | |
- | -psm N | + | <code bash>-l lang</code> |
- | Set Tesseract to only run a subset of layout analysis and assume a | + | |
- | certain form of image. The options for N are: | + | |
- | 0 = Orientation and script detection (OSD) only. | + | The language to use. If none is specified, English is assumed. Multiple languages may be specified, separated by plus characters. Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES) |
- | 1 = Automatic page segmentation with OSD. | + | |
- | 2 = Automatic page segmentation, but no OSD, or OCR. | + | |
- | 3 = Fully automatic page segmentation, but no OSD. (Default) | + | |
- | 4 = Assume a single column of text of variable sizes. | + | |
- | 5 = Assume a single uniform block of vertically aligned text. | + | |
- | 6 = Assume a single uniform block of text. | + | |
- | 7 = Treat the image as a single text line. | + | |
- | 8 = Treat the image as a single word. | + | |
- | 9 = Treat the image as a single word in a circle. | + | |
- | 10 = Treat the image as a single character. | + | |
- | -v | + | <code bash>-psm N</code> |
- | Returns the current version of the tesseract(1) executable. | + | Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are: |
- | configfile | + | * 0 = Orientation and script detection (OSD) only. |
- | The name of a config to use. A config is a plaintext file which | + | * 1 = Automatic page segmentation with OSD. |
- | contains a list of variables and their values, one per line, with a | + | * 2 = Automatic page segmentation, but no OSD, or OCR. |
- | space separating variable from value. Interesting config files | + | * 3 = Fully automatic page segmentation, but no OSD. (Default) |
- | include: | + | * 4 = Assume a single column of text of variable sizes. |
+ | * 5 = Assume a single uniform block of vertically aligned text. | ||
+ | * 6 = Assume a single uniform block of text. | ||
+ | * 7 = Treat the image as a single text line. | ||
+ | * 8 = Treat the image as a single word. | ||
+ | * 9 = Treat the image as a single word in a circle. | ||
+ | * 10 = Treat the image as a single character. | ||
- | o hocr - Output in hOCR format instead of as a text file. | + | <code bash>-v</code> |
+ | Returns the current version of the tesseract(1) executable. | ||
- | Nota Bene: The options -l lang and -psm N must occur before any | + | <code bash>configfile</code> |
- | configfile. | + | |
- | === Языки === | + | The name of a config to use. A config is a plaintext file which contains a list of variables and their values, one per line, with a space separating variable from value. Interesting config files include: |
- | There are currently language packs available for the following | + | |
- | languages: | + | |
- | ara (Arabic), aze (Azerbauijani), bul (Bulgarian), cat (Catalan), ces | + | o hocr - Output in hOCR format instead of as a text file. |
- | (Czech), chi_sim (Simplified Chinese), chi_tra (Traditional Chinese), | + | |
- | chr (Cherokee), dan (Danish), dan-frak (Danish (Fraktur)), deu | + | |
- | (German), ell (Greek), eng (English), enm (Old English), epo | + | |
- | (Esperanto), est (Estonian), fin (Finnish), fra (French), frm (Old | + | |
- | French), glg (Galician), heb (Hebrew), hin (Hindi), hrv (Croation), hun | + | |
- | (Hungarian), ind (Indonesian), ita (Italian), jpn (Japanese), kor | + | |
- | (Korean), lav (Latvian), lit (Lithuanian), nld (Dutch), nor | + | |
- | (Norwegian), pol (Polish), por (Portuguese), ron (Romanian), rus | + | |
- | (Russian), slk (Slovakian), slv (Slovenian), sqi (Albanian), spa | + | |
- | (Spanish), srp (Serbian), swe (Swedish), tam (Tamil), tel (Telugu), tgl | + | |
- | (Tagalog), tha (Thai), tur (Turkish), ukr (Ukrainian), vie (Vietnamese) | + | |
- | To use a non-standard language pack named foo.traineddata, set the | + | <note>Nota Bene: The options -l lang and -psm N must occur before any configfile.</note> |
- | TESSDATA_PREFIX environment variable so the file can be found at | + | ==== Языки ==== |
- | TESSDATA_PREFIX/tessdata/foo.traineddata and give Tesseract the | + | There are currently language packs available for the following languages: |
- | argument -l foo. | + | |
- | === История === | + | - ara (Arabic), |
+ | - aze (Azerbauijani), | ||
+ | - bul (Bulgarian), | ||
+ | - cat (Catalan), | ||
+ | - ces (Czech), | ||
+ | - chi_sim (Simplified Chinese), | ||
+ | - chi_tra (Traditional Chinese), | ||
+ | - chr (Cherokee), | ||
+ | - dan (Danish), | ||
+ | - dan-frak (Danish (Fraktur)), | ||
+ | - deu (German), | ||
+ | - ell (Greek), | ||
+ | - eng (English), | ||
+ | - enm (Old English), | ||
+ | - epo (Esperanto), | ||
+ | - est (Estonian), | ||
+ | - fin (Finnish), | ||
+ | - fra (French), | ||
+ | - frm (Old French), | ||
+ | - glg (Galician), | ||
+ | - heb (Hebrew), | ||
+ | - hin (Hindi), | ||
+ | - hrv (Croation), | ||
+ | - hun (Hungarian), | ||
+ | - ind (Indonesian), | ||
+ | - ita (Italian), | ||
+ | - jpn (Japanese), | ||
+ | - kor (Korean), | ||
+ | - lav (Latvian), | ||
+ | - lit (Lithuanian), | ||
+ | - nld (Dutch), | ||
+ | - nor (Norwegian), | ||
+ | - pol (Polish), | ||
+ | - por (Portuguese), | ||
+ | - ron (Romanian), | ||
+ | - rus (Russian), | ||
+ | - slk (Slovakian), | ||
+ | - slv (Slovenian), | ||
+ | - sqi (Albanian), | ||
+ | - spa (Spanish), | ||
+ | - srp (Serbian), | ||
+ | - swe (Swedish), | ||
+ | - tam (Tamil), | ||
+ | - tel (Telugu), | ||
+ | - tgl (Tagalog), | ||
+ | - tha (Thai), | ||
+ | - tur (Turkish), | ||
+ | - ukr (Ukrainian), | ||
+ | - vie (Vietnamese) | ||
+ | |||
+ | To use a non-standard language pack named foo.traineddata, set the TESSDATA_PREFIX environment variable so the file can be found at TESSDATA_PREFIX/tessdata/foo.traineddata and give Tesseract the argument -l foo. | ||
+ | |||
+ | ==== История ==== | ||
The engine was developed at Hewlett Packard Laboratories Bristol and at | The engine was developed at Hewlett Packard Laboratories Bristol and at | ||
Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some | Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some | ||
Строка 112: | Строка 133: | ||
distribution. | distribution. | ||
- | === Ресурсы === | + | ==== Ресурсы ==== |
- | Main web site: http://code.google.com/p/tesseract-ocr/ Information on | + | * Сайт проекта: https://github.com/tesseract-ocr |
- | training: | + | * Документация: https://github.com/tesseract-ocr/tesseract/wiki |
- | http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 | + | * Википедия: https://ru.wikipedia.org/wiki/Tesseract |
- | + | ==== Смотрите также ==== | |
- | === Смотрите также === | + | |
ambiguous_words(1), cntraining(1), combine_tessdata(1), | ambiguous_words(1), cntraining(1), combine_tessdata(1), | ||
dawg2wordlist(1), shape_training(1), mftraining(1), unicharambigs(5), | dawg2wordlist(1), shape_training(1), mftraining(1), unicharambigs(5), | ||
unicharset(5), unicharset_extractor(1), wordlist2dawg(1) | unicharset(5), unicharset_extractor(1), wordlist2dawg(1) | ||
- | === Автор === | + | ==== Авторы ==== |
- | Tesseract development was led at Hewlett-Packard and Google by Ray | + | |
- | Smith. The development team has included: | + | Разработка ''tesseract'' была возглавлена Hewlett-Packard и Ray Smith от Google. Команда разработчиков состоит из: ((Tesseract development was led at Hewlett-Packard and Google by Ray Smith. The development team has included:)) |
Ahmad Abdulkader, Chris Newton, Dan Johnson, Dar-Shyang Lee, David | Ahmad Abdulkader, Chris Newton, Dan Johnson, Dar-Shyang Lee, David | ||
Строка 133: | Строка 153: | ||
Samuel Charron, Sheelagh Lloyd, Shobhit Saxena, and Thomas Kielbus. | Samuel Charron, Sheelagh Lloyd, Shobhit Saxena, and Thomas Kielbus. | ||
- | === COPYING === | + | ==== Копирование ==== |
- | Licensed under the Apache License, Version 2.0 | + | |
+ | Зарегистрирован под лицензией //Apache License, Version 2.0// | ||
{{tag>tesseract}} | {{tag>tesseract}} |