Tesseract Сравнение версий

Различия

Здесь показаны различия между двумя версиями данной страницы.

Ссылка на это сравнение

Предыдущая версия справа и слева Предыдущая версия
Следующая версия
Предыдущая версия
wiki:tesseract [2012/07/20 10:24]
[Автор]
wiki:tesseract [2017/03/22 20:56] (текущий)
Строка 1: Строка 1:
 ======== Tesseract ======== ======== Tesseract ========
 +''​tesseract''​ - консольный OCR движок.
  
-''​tesseract'' ​консольный OCR движок ​+==== Описание ====  
 +''​Tesseract'' ​является качественным ​консольным OCR движком с открытым исходным кодом. В настоящий момент программа работает с UTF-8, поддержка языков (включая русский с версии 3.0) осуществляется с помощью дополнительных модулей. 
 + 
 +Существуют несколько графических интерфейсов (GUI) для программы:​ //​gImageReader,​ OCRFeeder, YAGF//.
  
 ==== Синтаксис ==== ==== Синтаксис ====
-<code bash> +<code bash>​tesseract imagename outbase [-l язык] [-psm N] [configfile ...]</​code>​
-       tesseract imagename outbase [-l lang] [-psm N] [configfile ...] +
-       </​code>​+
  
-==== DESCRIPTION ​====  +==== Опции ​====  
-       tesseract(1) is a commercial quality OCR engine originally developed at +<code bash>​imagename</​code>​ 
-       HP between 1985 and 1995. In 1995, this engine was among the top 3 +The name of the input imageMost image file formats (anything ​  ​readable ​by Leptonica) are supported.
-       ​evaluated by UNLVIt was open-sourced ​by HP and UNLV in 2005, and has +
-       been developed at Google since then.+
  
-==== OPTIONS ====  +<code bash>​outbase</​code>​ 
-       imagename +The basename ​of the output ​file (to which the appropriate extension ​  will be appended). By default the output will be named outbase.txt.
-           The name of the input image. Most image file formats ​(anything +
-           ​readable by Leptonicaare supported.+
  
-       ​outbase +<code bash>-l lang</​code>​
-           The basename of the output file (to which the appropriate extension +
-           will be appended). By default the output will be named outbase.txt.+
  
-       -l lang +The language to use. If none is specified, English is assumed. Multiple languages may be specified, separated by plus characters. Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES)
-           The language to use. If none is specified, English is assumed. +
-           Multiple languages may be specified, separated by plus characters. +
-           Tesseract uses 3-character ISO 639-2 language codes. (See +
-           LANGUAGES)+
  
-       -psm N +<code bash>-psm N</​code>​ 
-           ​Set Tesseract to only run a subset of layout analysis and assume a +Set Tesseract to only run a subset of layout analysis and assume a   ​certain form of image. The options for N are:
-           certain form of image. The options for N are:+
  
-               0 = Orientation and script detection (OSD) only. +  *    ​0 = Orientation and script detection (OSD) only. 
-               ​1 = Automatic page segmentation with OSD. +  ​* ​   ​1 = Automatic page segmentation with OSD. 
-               ​2 = Automatic page segmentation,​ but no OSD, or OCR. +  ​* ​   ​2 = Automatic page segmentation,​ but no OSD, or OCR. 
-               ​3 = Fully automatic page segmentation,​ but no OSD. (Default) +  ​* ​   ​3 = Fully automatic page segmentation,​ but no OSD. (Default) 
-               ​4 = Assume a single column of text of variable sizes. +  ​* ​   ​4 = Assume a single column of text of variable sizes. 
-               ​5 = Assume a single uniform block of vertically aligned text. +  ​* ​   ​5 = Assume a single uniform block of vertically aligned text. 
-               ​6 = Assume a single uniform block of text. +  ​* ​   ​6 = Assume a single uniform block of text. 
-               ​7 = Treat the image as a single text line. +  ​* ​   ​7 = Treat the image as a single text line. 
-               ​8 = Treat the image as a single word. +  ​* ​   ​8 = Treat the image as a single word. 
-               ​9 = Treat the image as a single word in a circle. +  ​* ​   ​9 = Treat the image as a single word in a circle. 
-               ​10 = Treat the image as a single character.+  ​* ​   ​10 = Treat the image as a single character.
  
-       -v +<code bash>-v</​code>​ 
-           ​Returns the current version of the tesseract(1) executable.+Returns the current version of the tesseract(1) executable.
  
-       configfile +<code bash>configfile</​code>​
-           The name of a config to use. A config is a plaintext file which +
-           ​contains a list of variables and their values, one per line, with a +
-           space separating variable from value. Interesting config files +
-           ​include:​+
  
-           ​o ​  hocr - Output in hOCR format instead ​of as text file.+The name of a config to use. A config is a plaintext ​file which   ​contains a list of variables and their values, one per line, with a space separating variable from valueInteresting config files   ​include:​
  
-       Nota Bene: The options ​-l lang and -psm N must occur before any +   ​o ​  ​hocr ​Output in hOCR format instead of as a text file.
-       ​configfile.+
  
 +<​note>​Nota Bene: The options -l lang and -psm N must occur before any configfile.</​note>​
 ==== Языки ====  ==== Языки ==== 
-       There are currently language packs available for the following +There are currently language packs available for the following ​      ​languages:​
-       languages:+
  
-       ara (Arabic), aze (Azerbauijani),​ bul (Bulgarian),​ cat (Catalan), ​ces +   ​-  ​ara (Arabic), 
-       ​(Czech), chi_sim (Simplified Chinese), chi_tra (Traditional Chinese), +   ​-  ​aze (Azerbauijani),​ 
-       ​chr (Cherokee), dan (Danish), dan-frak (Danish (Fraktur)), ​deu +   ​-  ​bul (Bulgarian), ​ 
-       ​(German), ell (Greek), eng (English), enm (Old English), ​epo +   ​- ​cat (Catalan),​ 
-       ​(Esperanto),​ est (Estonian), fin (Finnish), fra (French), frm (Old +   -  ces (Czech), 
-       French), glg (Galician), heb (Hebrew), hin (Hindi), hrv (Croation), ​hun +   ​-  ​chi_sim (Simplified Chinese), 
-       ​(Hungarian),​ ind (Indonesian),​ ita (Italian), jpn (Japanese), ​kor +   ​-  ​chi_tra (Traditional Chinese), 
-       ​(Korean), lav (Latvian), lit (Lithuanian),​ nld (Dutch), ​nor +   -  ​chr (Cherokee), 
-       ​(Norwegian),​ pol (Polish), por (Portuguese),​ ron (Romanian), ​rus +   ​-  ​dan (Danish), 
-       ​(Russian), slk (Slovakian),​ slv (Slovenian),​ sqi (Albanian), ​spa +   ​-  ​dan-frak (Danish (Fraktur)),​ 
-       ​(Spanish), srp (Serbian), swe (Swedish), tam (Tamil), tel (Telugu), ​tgl +   -  deu  ​(German), ​ 
-       ​(Tagalog), tha (Thai), tur (Turkish), ukr (Ukrainian),​ vie (Vietnamese)+   ​- ​ell (Greek), ​ 
 +   ​- ​eng (English), ​ 
 +   ​- ​enm (Old English), 
 +   -  epo (Esperanto),​ 
 +   ​-  ​est (Estonian), ​ 
 +   ​- ​fin (Finnish), 
 +   ​-  ​fra (French), 
 +   ​-  ​frm (Old   ​French), ​ 
 +   ​- ​glg (Galician), ​ 
 +   ​- ​heb (Hebrew), 
 +   ​-  ​hin (Hindi), 
 +   ​-  ​hrv (Croation),​ 
 +   -  hun   (Hungarian), ​ 
 +   ​- ​ind (Indonesian), ​ 
 +   ​- ​ita (Italian), ​ 
 +   ​- ​jpn (Japanese),​ 
 +   -  kor (Korean), 
 +   ​-  ​lav (Latvian), 
 +   ​-  ​lit (Lithuanian),​ 
 +   ​-  ​nld (Dutch),  
 +   - nor (Norwegian), ​ 
 +   ​- ​pol (Polish), 
 +   ​-  ​por (Portuguese),​ 
 +   ​-  ​ron (Romanian),​ 
 +   -  rus (Russian), 
 +   ​-  ​slk (Slovakian), ​ 
 +   ​- ​slv (Slovenian), ​ 
 +   ​- ​sqi (Albanian),​ 
 +   -  spa (Spanish), 
 +   ​-  ​srp (Serbian), 
 +   ​-  ​swe (Swedish), ​ 
 +   ​- ​tam (Tamil), 
 +   ​-  ​tel (Telugu), 
 +   -  tgl (Tagalog), 
 +   ​-  ​tha (Thai), ​ 
 +   ​- ​tur (Turkish), ​ 
 +   ​- ​ukr (Ukrainian),​ 
 +   ​-  ​vie (Vietnamese)
  
-       To use a non-standard language pack named foo.traineddata,​ set the +To use a non-standard language pack named foo.traineddata,​ set the       ​TESSDATA_PREFIX environment variable so the file can be found at       ​TESSDATA_PREFIX/​tessdata/​foo.traineddata and give Tesseract the       ​argument -l foo.
-       TESSDATA_PREFIX environment variable so the file can be found at +
-       TESSDATA_PREFIX/​tessdata/​foo.traineddata and give Tesseract the +
-       argument -l foo.+
  
  ==== История ====   ==== История ==== 
 +''​Tesseract''​ был разработан компанией HP между 1985 и 1995, а затем десять лет не изменялся. В 2005 году были открыты исходные тексты. С 2006 года разработку движка спонсирует компания Google.
 +
        The engine was developed at Hewlett Packard Laboratories Bristol and at        The engine was developed at Hewlett Packard Laboratories Bristol and at
        ​Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some        ​Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some
Строка 113: Строка 135:
  
  ==== Ресурсы ====   ==== Ресурсы ==== 
-       Main web sitehttp://code.google.com/p/​tesseract-ocr/ Information on +  * Сайт проектаhttps://github.com/​tesseract-ocr 
-       training: +  * Документацияhttps://github.com/​tesseract-ocr/​tesseract/wiki 
-       http://code.google.com/p/​tesseract-ocr/​wiki/TrainingTesseract3 +  * Википедия:​ https://​ru.wikipedia.org/​wiki/​Tesseract
  ==== Смотрите также ====   ==== Смотрите также ==== 
        ​ambiguous_words(1),​ cntraining(1),​ combine_tessdata(1),​        ​ambiguous_words(1),​ cntraining(1),​ combine_tessdata(1),​
Строка 133: Строка 154:
        ​Samuel Charron, Sheelagh Lloyd, Shobhit Saxena, and Thomas Kielbus.        ​Samuel Charron, Sheelagh Lloyd, Shobhit Saxena, and Thomas Kielbus.
  
-==== COPYING ​====  +==== Копирование ​==== 
-       ​Licensed under the Apache License, Version 2.0+
  
 +Зарегистрирован под лицензией //Apache License, Version 2.0//
  
  
  
 {{tag>​tesseract}} {{tag>​tesseract}}