Различия

Здесь показаны различия между двумя версиями данной страницы.

--- wiki:tesseract [2012/07/20 10:13]
создано
+++ wiki:tesseract [2017/03/22 20:56] (текущий)
@@ Строка 1: / Строка 1: @@
-====== Tesseract ======
+======== Tesseract ========
+''tesseract'' - консольный OCR движок.
-tesseract - консольный OCR движок
+==== Описание ====
+''Tesseract'' является качественным консольным OCR движком с открытым исходным кодом. В настоящий момент программа работает с UTF-8, поддержка языков (включая русский с версии 3.0) осуществляется с помощью дополнительных модулей.
-=== Синтаксис ===
+Существуют несколько графических интерфейсов (GUI) для программы: //gImageReader, OCRFeeder, YAGF//.
-<code bash>
-       tesseract imagename outbase [-l lang] [-psm N] [configfile ...]
-       </code>
-=== DESCRIPTION ===
+==== Синтаксис ====
-       tesseract(1) is a commercial quality OCR engine originally developed at
+<code bash>tesseract imagename outbase [-l язык] [-psm N] [configfile ...]</code>
-       HP between 1985 and 1995. In 1995, this engine was among the top 3
-       evaluated by UNLV. It was open-sourced by HP and UNLV in 2005, and has
-       been developed at Google since then.
-=== OPTIONS ===
+==== Опции ====
-       imagename
+<code bash>imagename</code>
-           The name of the input image. Most image file formats (anything
+The name of the input image. Most image file formats (anything   readable by Leptonica) are supported.
-           readable by Leptonica) are supported.
-       outbase
+<code bash>outbase</code>
-           The basename of the output file (to which the appropriate extension
+The basename of the output file (to which the appropriate extension   will be appended). By default the output will be named outbase.txt.
-           will be appended). By default the output will be named outbase.txt.
-       -l lang
+<code bash>-l lang</code>
-           The language to use. If none is specified, English is assumed.
-           Multiple languages may be specified, separated by plus characters.
-           Tesseract uses 3-character ISO 639-2 language codes. (See
-           LANGUAGES)
-       -psm N
+The language to use. If none is specified, English is assumed. Multiple languages may be specified, separated by plus characters. Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES)
-           Set Tesseract to only run a subset of layout analysis and assume a
-           certain form of image. The options for N are:
-= Orientation and script detection (OSD) only.
+<code bash>-psm N</code>
-= Automatic page segmentation with OSD.
+Set Tesseract to only run a subset of layout analysis and assume a   certain form of image. The options for N are:
-= Automatic page segmentation, but no OSD, or OCR.
-= Fully automatic page segmentation, but no OSD. (Default)
-= Assume a single column of text of variable sizes.
-= Assume a single uniform block of vertically aligned text.
-= Assume a single uniform block of text.
-= Treat the image as a single text line.
-= Treat the image as a single word.
-= Treat the image as a single word in a circle.
-= Treat the image as a single character.
-       -v
+  *    0 = Orientation and script detection (OSD) only.
-           Returns the current version of the tesseract(1) executable.
+  *    1 = Automatic page segmentation with OSD.
+  *    2 = Automatic page segmentation, but no OSD, or OCR.
+  *    3 = Fully automatic page segmentation, but no OSD. (Default)
+  *    4 = Assume a single column of text of variable sizes.
+  *    5 = Assume a single uniform block of vertically aligned text.
+  *    6 = Assume a single uniform block of text.
+  *    7 = Treat the image as a single text line.
+  *    8 = Treat the image as a single word.
+  *    9 = Treat the image as a single word in a circle.
+  *    10 = Treat the image as a single character.
-       configfile
+<code bash>-v</code>
-           The name of a config to use. A config is a plaintext file which
+Returns the current version of the tesseract(1) executable.
-           contains a list of variables and their values, one per line, with a
-           space separating variable from value. Interesting config files
-           include:
-           o   hocr - Output in hOCR format instead of as a text file.
+<code bash>configfile</code>
-       Nota Bene: The options -l lang and -psm N must occur before any
+The name of a config to use. A config is a plaintext file which   contains a list of variables and their values, one per line, with a space separating variable from value. Interesting config files   include:
-       configfile.
-=== Языки ===
+   o   hocr - Output in hOCR format instead of as a text file.
-       There are currently language packs available for the following
-       languages:
-       ara (Arabic), aze (Azerbauijani), bul (Bulgarian), cat (Catalan), ces
+<note>Nota Bene: The options -l lang and -psm N must occur before any configfile.</note>
-       (Czech), chi_sim (Simplified Chinese), chi_tra (Traditional Chinese),
+==== Языки ====
-       chr (Cherokee), dan (Danish), dan-frak (Danish (Fraktur)), deu
+There are currently language packs available for the following       languages:
-       (German), ell (Greek), eng (English), enm (Old English), epo
-       (Esperanto), est (Estonian), fin (Finnish), fra (French), frm (Old
-       French), glg (Galician), heb (Hebrew), hin (Hindi), hrv (Croation), hun
-       (Hungarian), ind (Indonesian), ita (Italian), jpn (Japanese), kor
-       (Korean), lav (Latvian), lit (Lithuanian), nld (Dutch), nor
-       (Norwegian), pol (Polish), por (Portuguese), ron (Romanian), rus
-       (Russian), slk (Slovakian), slv (Slovenian), sqi (Albanian), spa
-       (Spanish), srp (Serbian), swe (Swedish), tam (Tamil), tel (Telugu), tgl
-       (Tagalog), tha (Thai), tur (Turkish), ukr (Ukrainian), vie (Vietnamese)
-       To use a non-standard language pack named foo.traineddata, set the
+   -  ara (Arabic),
-       TESSDATA_PREFIX environment variable so the file can be found at
+   -  aze (Azerbauijani),
-       TESSDATA_PREFIX/tessdata/foo.traineddata and give Tesseract the
+   -  bul (Bulgarian),
-       argument -l foo.
+   - cat (Catalan),
+   -  ces (Czech),
+   -  chi_sim (Simplified Chinese),
+   -  chi_tra (Traditional Chinese),
+   -  chr (Cherokee),
+   -  dan (Danish),
+   -  dan-frak (Danish (Fraktur)),
+   -  deu  (German),
+   - ell (Greek),
+   - eng (English),
+   - enm (Old English),
+   -  epo (Esperanto),
+   -  est (Estonian),
+   - fin (Finnish),
+   -  fra (French),
+   -  frm (Old   French),
+   - glg (Galician),
+   - heb (Hebrew),
+   -  hin (Hindi),
+   -  hrv (Croation),
+   -  hun   (Hungarian),
+   - ind (Indonesian),
+   - ita (Italian),
+   - jpn (Japanese),
+   -  kor (Korean),
+   -  lav (Latvian),
+   -  lit (Lithuanian),
+   -  nld (Dutch),
+   - nor (Norwegian),
+   - pol (Polish),
+   -  por (Portuguese),
+   -  ron (Romanian),
+   -  rus (Russian),
+   -  slk (Slovakian),
+   - slv (Slovenian),
+   - sqi (Albanian),
+   -  spa (Spanish),
+   -  srp (Serbian),
+   -  swe (Swedish),
+   - tam (Tamil),
+   -  tel (Telugu),
+   -  tgl (Tagalog),
+   -  tha (Thai),
+   - tur (Turkish),
+   - ukr (Ukrainian),
+   -  vie (Vietnamese)
+To use a non-standard language pack named foo.traineddata, set the       TESSDATA_PREFIX environment variable so the file can be found at       TESSDATA_PREFIX/tessdata/foo.traineddata and give Tesseract the       argument -l foo.
+ ==== История ====
+''Tesseract'' был разработан компанией HP между 1985 и 1995, а затем десять лет не изменялся. В 2005 году были открыты исходные тексты. С 2006 года разработку движка спонсирует компания Google.
- === История ===
        The engine was developed at Hewlett Packard Laboratories Bristol and at
        Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some
@@ Строка 112: / Строка 134: @@
        distribution.
- === Ресурсы ===
+ ==== Ресурсы ====
-       Main web site: http://code.google.com/p/tesseract-ocr/ Information on
+  * Сайт проекта: https://github.com/tesseract-ocr
-       training:
+  * Документация: https://github.com/tesseract-ocr/tesseract/wiki
-       http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
+  * Википедия: https://ru.wikipedia.org/wiki/Tesseract
+ ==== Смотрите также ====
- === Смотрите также ===
        ambiguous_words(1), cntraining(1), combine_tessdata(1),
        dawg2wordlist(1), shape_training(1), mftraining(1), unicharambigs(5),
        unicharset(5), unicharset_extractor(1), wordlist2dawg(1)
- === Автор ===
+ ==== Авторы ====
-       Tesseract development was led at Hewlett-Packard and Google by Ray
-       Smith. The development team has included:
+Разработка ''tesseract'' была возглавлена Hewlett-Packard и Ray Smith от Google. Команда разработчиков состоит из: ((Tesseract development was led at Hewlett-Packard and Google by Ray Smith. The development team has included:))
        Ahmad Abdulkader, Chris Newton, Dan Johnson, Dar-Shyang Lee, David
@@ Строка 133: / Строка 154: @@
        Samuel Charron, Sheelagh Lloyd, Shobhit Saxena, and Thomas Kielbus.
-=== COPYING ===
+==== Копирование ====
-       Licensed under the Apache License, Version 2.0
+Зарегистрирован под лицензией //Apache License, Version 2.0//
 {{tag>tesseract}}

Tesseract Сравнение версий

Различия