Tesseract Сравнение версий

Различия

Здесь показаны различия между двумя версиями данной страницы.

Ссылка на это сравнение

Следующая версия
Предыдущая версия
wiki:tesseract [2012/07/20 10:13]
Oleg Orlov создано
wiki:tesseract [2017/03/22 20:56] (текущий)
1234567890
Строка 1: Строка 1:
-====== Tesseract ======+======== Tesseract ======== 
 +''​tesseract''​ - консольный OCR движок.
  
-tesseract - консольный OCR движок ​+==== Описание ====  
 +''​Tesseract''​ является качественным ​консольным OCR движком с открытым исходным кодом. В настоящий момент программа работает с UTF-8, поддержка языков (включая русский с версии 3.0) осуществляется с помощью дополнительных модулей.
  
-=== Синтаксис ​=== +Существуют несколько графических интерфейсов (GUI) для программы: ​//​gImageReader,​ OCRFeeder, YAGF//.
-<code bash> +
-       ​tesseract imagename outbase [-l lang] [-psm N] [configfile ...] +
-       </code>+
  
-=== DESCRIPTION ​===  +==== Синтаксис ==== 
-       ​tesseract(1) is a commercial quality OCR engine originally developed at +<code bash>tesseract ​imagename outbase [-l язык] [-psm N] [configfile ​...]</​code>​
-       HP between 1985 and 1995In 1995, this engine was among the top 3 +
-       ​evaluated by UNLVIt was open-sourced by HP and UNLV in 2005, and has +
-       been developed at Google since then.+
  
-=== OPTIONS ​===  +==== Опции ====  
-       ​imagename +<code bash>imagename</​code>​ 
-           ​The name of the input image. Most image file formats (anything +The name of the input image. Most image file formats (anything ​  ​readable by Leptonica) are supported.
-           readable by Leptonica) are supported.+
  
-       outbase +<code bash>outbase</​code>​ 
-           ​The basename of the output file (to which the appropriate extension +The basename of the output file (to which the appropriate extension ​  ​will be appended). By default the output will be named outbase.txt.
-           will be appended). By default the output will be named outbase.txt.+
  
-       -l lang +<code bash>-l lang</​code>​
-           The language to use. If none is specified, English is assumed. +
-           ​Multiple languages may be specified, separated by plus characters. +
-           ​Tesseract uses 3-character ISO 639-2 language codes. (See +
-           ​LANGUAGES)+
  
-       -psm N +The language ​to useIf none is specified, English is assumed. Multiple languages may be specified, separated by plus characters. Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES)
-           Set Tesseract ​to only run a subset of layout analysis and assume a +
-           ​certain form of imageThe options for N are:+
  
-               0 = Orientation and script detection (OSD) only+<code bash>​-psm N</​code>​ 
-               1 = Automatic page segmentation with OSD. +Set Tesseract to only run subset ​of layout analysis and assume ​  certain form of image. ​The options for N are:
-               2 = Automatic page segmentation,​ but no OSD, or OCR. +
-               3 = Fully automatic page segmentation,​ but no OSD. (Default) +
-               4 = Assume ​single column ​of text of variable sizes. +
-               5 = Assume ​single uniform block of vertically aligned text. +
-               6 = Assume a single uniform block of text. +
-               7 = Treat the image as a single text line. +
-               8 = Treat the image as a single word. +
-               9 = Treat the image as a single word in a circle. +
-               10 = Treat the image as a single character.+
  
-       -v +  *    0 = Orientation and script detection ​(OSD) only. 
-           ​Returns the current version of the tesseract(1) executable.+  *    ​= Automatic page segmentation with OSD. 
 +  *    2 = Automatic page segmentation,​ but no OSD, or OCR. 
 +  *    3 = Fully automatic page segmentation,​ but no OSD. (Default) 
 +  *    4 = Assume a single column of text of variable sizes. 
 +  *    5 = Assume a single uniform block of vertically aligned text. 
 +  *    6 = Assume a single uniform block of text. 
 +  *    7 = Treat the image as a single text line. 
 +  *    8 = Treat the image as a single word. 
 +  *    9 = Treat the image as a single word in a circle. 
 +  *    10 = Treat the image as a single character.
  
-       ​configfile +<code bash>​-v</​code>​ 
-           The name of a config to useA config is a plaintext file which +Returns the current version ​of the tesseract(1) executable.
-           ​contains a list of variables and their values, one per line, with a +
-           space separating variable from value. Interesting config files +
-           ​include:​+
  
-           ​o ​  hocr - Output in hOCR format instead of as a text file.+<code bash>​configfile</​code>​
  
-       Nota Bene: The options -l lang and -psm N must occur before any +The name of a config to use. A config is a plaintext file which   ​contains a list of variables ​and their values, one per line, with a space separating variable from valueInteresting config files   ​include:​
-       ​configfile.+
  
-=== Языки ===  +   ​o ​  hocr - Output in hOCR format instead of as a text file.
-       There are currently language packs available for the following +
-       ​languages:​+
  
-       ara (Arabic), aze (Azerbauijani),​ bul (Bulgarian),​ cat (Catalan), ces +<​note>​Nota Bene: The options ​-l lang and -psm N must occur before any configfile.</​note>​ 
-       ​(Czech),​ chi_sim (Simplified Chinese), chi_tra (Traditional Chinese), +==== Языки ====  
-       chr (Cherokee), dan (Danish), dan-frak (Danish (Fraktur)), deu +There are currently language packs available for the following ​      ​languages:
-       (German), ell (Greek), eng (English), enm (Old English), epo +
-       ​(Esperanto),​ est (Estonian), fin (Finnish), fra (French), frm (Old +
-       ​French),​ glg (Galician), heb (Hebrew), hin (Hindi), hrv (Croation), hun +
-       ​(Hungarian),​ ind (Indonesian),​ ita (Italian), jpn (Japanese), kor +
-       ​(Korean),​ lav (Latvian), lit (Lithuanian),​ nld (Dutch), nor +
-       ​(Norwegian),​ pol (Polish), por (Portuguese),​ ron (Romanian), rus +
-       ​(Russian),​ slk (Slovakian),​ slv (Slovenian),​ sqi (Albanian), spa +
-       ​(Spanish),​ srp (Serbian), swe (Swedish), tam (Tamil), tel (Telugu), tgl +
-       (Tagalog), tha (Thai), tur (Turkish), ukr (Ukrainian),​ vie (Vietnamese)+
  
-       To use a non-standard language pack named foo.traineddata,​ set the +   ​- ​ ara (Arabic), 
-       TESSDATA_PREFIX environment variable so the file can be found at +   ​- ​ aze (Azerbauijani),​ 
-       TESSDATA_PREFIX/​tessdata/​foo.traineddata and give Tesseract the +   ​- ​ bul (Bulgarian),​  
-       argument -l foo.+   - cat (Catalan),​ 
 +   ​- ​ ces (Czech), 
 +   ​- ​ chi_sim (Simplified Chinese), 
 +   ​- ​ chi_tra (Traditional Chinese), 
 +   ​- ​ chr (Cherokee),​ 
 +   ​- ​ dan (Danish), 
 +   ​- ​ dan-frak (Danish (Fraktur)),​ 
 +   ​- ​ deu  (German),  
 +   - ell (Greek),  
 +   - eng (English),  
 +   - enm (Old English), 
 +   ​- ​ epo (Esperanto),​ 
 +   ​- ​ est (Estonian),  
 +   - fin (Finnish),​ 
 +   ​- ​ fra (French), 
 +   ​- ​ frm (Old   ​French),​  
 +   - glg (Galician),  
 +   - heb (Hebrew), 
 +   ​- ​ hin (Hindi), 
 +   ​- ​ hrv (Croation),​ 
 +   ​- ​ hun   ​(Hungarian),​  
 +   - ind (Indonesian),​  
 +   - ita (Italian),  
 +   - jpn (Japanese),​ 
 +   ​- ​ kor (Korean), 
 +   ​- ​ lav (Latvian),​ 
 +   ​- ​ lit (Lithuanian),​ 
 +   ​- ​ nld (Dutch),  
 +   - nor (Norwegian),​  
 +   - pol (Polish), 
 +   ​- ​ por (Portuguese),​ 
 +   ​- ​ ron (Romanian),​ 
 +   ​- ​ rus (Russian),​ 
 +   ​- ​ slk (Slovakian),​  
 +   - slv (Slovenian),​  
 +   - sqi (Albanian),​ 
 +   ​- ​ spa (Spanish),​ 
 +   ​- ​ srp (Serbian),​ 
 +   ​- ​ swe (Swedish),  
 +   - tam (Tamil), 
 +   ​- ​ tel (Telugu), 
 +   ​- ​ tgl (Tagalog),​ 
 +   ​- ​ tha (Thai),  
 +   - tur (Turkish),  
 +   - ukr (Ukrainian),​ 
 +   ​- ​ vie (Vietnamese) 
 + 
 +To use a non-standard language pack named foo.traineddata,​ set the       ​TESSDATA_PREFIX environment variable so the file can be found at       ​TESSDATA_PREFIX/​tessdata/​foo.traineddata and give Tesseract the       ​argument -l foo
 + 
 + ==== История ====  
 +''​Tesseract''​ был разработан компанией HP между 1985 и 1995, а затем десять лет не изменялся. В 2005 году были открыты исходные тексты. С 2006 года разработку движка спонсирует компания Google.
  
- === История ===  
        The engine was developed at Hewlett Packard Laboratories Bristol and at        The engine was developed at Hewlett Packard Laboratories Bristol and at
        ​Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some        ​Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some
Строка 112: Строка 134:
        ​distribution.        ​distribution.
  
- === Ресурсы ===  + ==== Ресурсы ​====  
-       Main web sitehttp://code.google.com/p/​tesseract-ocr/ Information on +  * Сайт проектаhttps://github.com/​tesseract-ocr 
-       training: +  * Документацияhttps://github.com/​tesseract-ocr/​tesseract/wiki 
-       http://code.google.com/p/​tesseract-ocr/​wiki/TrainingTesseract3 +  * Википедия:​ https://​ru.wikipedia.org/​wiki/​Tesseract 
- + ==== Смотрите также ​==== 
- === Смотрите также === +
        ​ambiguous_words(1),​ cntraining(1),​ combine_tessdata(1),​        ​ambiguous_words(1),​ cntraining(1),​ combine_tessdata(1),​
        ​dawg2wordlist(1),​ shape_training(1),​ mftraining(1),​ unicharambigs(5),​        ​dawg2wordlist(1),​ shape_training(1),​ mftraining(1),​ unicharambigs(5),​
        ​unicharset(5),​ unicharset_extractor(1),​ wordlist2dawg(1)        ​unicharset(5),​ unicharset_extractor(1),​ wordlist2dawg(1)
  
- === Автор ===  + ==== Авторы ====  
-       ​Tesseract development was led at Hewlett-Packard and Google by Ray + 
-       Smith. The development team has included:+Разработка ''​tesseract''​ была возглавлена Hewlett-Packard и Ray Smith от Google. Команда разработчиков состоит из: ((Tesseract development was led at Hewlett-Packard and Google by Ray Smith. The development team has included:))
  
        Ahmad Abdulkader, Chris Newton, Dan Johnson, Dar-Shyang Lee, David        Ahmad Abdulkader, Chris Newton, Dan Johnson, Dar-Shyang Lee, David
Строка 133: Строка 154:
        ​Samuel Charron, Sheelagh Lloyd, Shobhit Saxena, and Thomas Kielbus.        ​Samuel Charron, Sheelagh Lloyd, Shobhit Saxena, and Thomas Kielbus.
  
-=== COPYING ​===  +==== Копирование ==== 
-       ​Licensed under the Apache License, Version 2.0+
  
 +Зарегистрирован под лицензией //Apache License, Version 2.0//
  
  
  
 {{tag>​tesseract}} {{tag>​tesseract}}