読者です 読者をやめる 読者になる 読者になる

DRYな備忘録

Don't Repeat Yourself.

Tesseract-OCRをソースからコンパイルする

tesseract-ocr

コンパイルして、共有ライブラリとして読み込まれる.soファイルをつくれることを確認したい。APIファイル(.hとか)はReleases · tesseract-ocr/tesseract · GitHubを解凍すれば同梱されてる。ついでに同環境下でそのTesseract-OCRがちゃんと動くことも確認したい。

事前準備: Dockerで雑に使い捨て開発環境つくる個人的なメモ - DRYな備忘録

参考

ログ

root@f456604ccbed:/# cd
root@f456604ccbed:~# mkdir workspace && cd workspace
root@f456604ccbed:~/workspace#
root@f456604ccbed:~/workspace# wget https://github.com/tesseract-ocr/tesseract/archive/3.04.01.tar.gz
root@f456604ccbed:~/workspace# tar -zxvf 3.04.01.tar.gz
root@f456604ccbed:~/workspace# cd tesseract-3.04.01/
root@f456604ccbed:~/workspace/tesseract-3.04.01#
root@f456604ccbed:~/workspace/tesseract-3.04.01# ./autogen.sh
Running aclocal
./autogen.sh: 60: ./autogen.sh: aclocal: not found

  Something went wrong, bailing out!

root@f456604ccbed:~/workspace/tesseract-3.04.01# apt-get install -y autotools-dev
root@f456604ccbed:~/workspace/tesseract-3.04.01# apt-get install -y automake
root@f456604ccbed:~/workspace/tesseract-3.04.01# ./autogen.sh
Running aclocal
Running libtoolize
./autogen.sh: 65: ./autogen.sh: libtoolize: not found
./autogen.sh: 65: ./autogen.sh: glibtoolize: not found

  Something went wrong, bailing out!

root@f456604ccbed:~/workspace/tesseract-3.04.01# apt-get install -y build-essential libtool
root@f456604ccbed:~/workspace/tesseract-3.04.01# ./autogen.sh
Running aclocal
Running libtoolize
libtoolize: putting auxiliary files in AC_CONFIG_AUX_DIR, `config`.
libtoolize: copying file `config/ltmain.sh`
libtoolize: putting macros in AC_CONFIG_MACRO_DIR, `m4`.
libtoolize: copying file `m4/libtool.m4`
libtoolize: copying file `m4/ltoptions.m4`
libtoolize: copying file `m4/ltsugar.m4`
libtoolize: copying file `m4/ltversion.m4`
libtoolize: copying file `m4/lt~obsolete.m4`
Running autoheader
Running automake --add-missing --copy
configure.ac:321: installing 'config/compile'
Running autoconf

All done.
To build the software now, do something like:

$ ./configure [--enable-debug] [...other options]
root@f456604ccbed:~/workspace/tesseract-3.04.01#

autogen.shの成功

root@f456604ccbed:~/workspace/tesseract-3.04.01# ./configure
# 中略
checking for leptonica... configure: error: leptonica not found
root@f456604ccbed:~/workspace/tesseract-3.04.01# cd ..
root@f456604ccbed:~/workspace# wget https://github.com/DanBloomberg/leptonica/archive/v1.73.tar.gz
root@f456604ccbed:~/workspace# tar -zxvf v1.73.tar.gz
root@f456604ccbed:~/workspace# cd leptonica-1.73/
root@f456604ccbed:~/workspace/leptonica-1.73# ./configure
bash: ./configure: Permission denied
root@f456604ccbed:~/workspace/leptonica-1.73# chmod 755 ./configure
root@f456604ccbed:~/workspace/leptonica-1.73#
root@f456604ccbed:~/workspace/leptonica-1.73# ./configure
root@f456604ccbed:~/workspace/leptonica-1.73# make

root@f456604ccbed:~/workspace/leptonica-1.73# make install
Making install in src
make[1]: Entering directory '/root/workspace/leptonica-1.73/src'
make[2]: Entering directory '/root/workspace/leptonica-1.73/src'
test -z "/usr/local/lib" || /bin/mkdir -p "/usr/local/lib"
# 中略
----------------------------------------------------------------------
Libraries have been installed in:
   /usr/local/lib

If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR`
flag during linking and do at least one of the following:
   - add LIBDIR to the `LD_LIBRARY_PATH` environment variable
     during execution
   - add LIBDIR to the `LD_RUN_PATH` environment variable
     during linking
   - use the `-Wl,-rpath -Wl,LIBDIR` linker flag
   - have your system administrator add LIBDIR to `/etc/ld.so.conf`

See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------
# 後略
root@f456604ccbed:~/workspace/leptonica-1.73# ls -l /usr/local/lib/
total 22156
-rw-r--r-- 1 root staff 14116202 Nov  6 20:35 liblept.a
-rwxr-xr-x 1 root staff      943 Nov  6 20:35 liblept.la
lrwxrwxrwx 1 root staff       16 Nov  6 20:35 liblept.so -> liblept.so.5.0.0
lrwxrwxrwx 1 root staff       16 Nov  6 20:35 liblept.so.5 -> liblept.so.5.0.0
-rwxr-xr-x 1 root staff  8559120 Nov  6 20:35 liblept.so.5.0.0
drwxr-sr-x 2 root staff     4096 Nov  6 20:35 pkgconfig
root@f456604ccbed:~/workspace/leptonica-1.73#

leptonicaのコンパイルは完了

root@f456604ccbed:~/workspace/leptonica-1.73# cd ../tesseract-3.04.01/
root@f456604ccbed:~/workspace/tesseract-3.04.01# export LIBLEPT_HEADERSDIR=/root/workspace/leptonica-1.73/src
root@f456604ccbed:~/workspace/tesseract-3.04.01# ./configure
# 中略

Configuration is done.
You can now build and install tesseract by running:

$ make
$ sudo make install

You can not build training tools because of missing dependency.
Check configure output for details.

training toolsがうんちゃらと言っているものの、tesseractのconfigureは完了

root@f456604ccbed:~/workspace/tesseract-3.04.01# make
root@f456604ccbed:~/workspace/tesseract-3.04.01# make install
# 中略
----------------------------------------------------------------------
Libraries have been installed in:
   /usr/local/lib

If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR`
flag during linking and do at least one of the following:
   - add LIBDIR to the `LD_LIBRARY_PATH` environment variable
     during execution
   - add LIBDIR to the `LD_RUN_PATH` environment variable
     during linking
   - use the `-Wl,-rpath -Wl,LIBDIR` linker flag
   - have your system administrator add LIBDIR to `/etc/ld.so.conf`

See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------

tesseractのmake, make installも完了。確認する

root@f456604ccbed:~/workspace/tesseract-3.04.01# cd
root@f456604ccbed:~# ls -l /usr/local/lib/
total 137364
-rw-r--r-- 1 root staff 14116202 Nov  6 20:35 liblept.a
-rwxr-xr-x 1 root staff      943 Nov  6 20:35 liblept.la
lrwxrwxrwx 1 root staff       16 Nov  6 20:35 liblept.so -> liblept.so.5.0.0
lrwxrwxrwx 1 root staff       16 Nov  6 20:35 liblept.so.5 -> liblept.so.5.0.0
-rwxr-xr-x 1 root staff  8559120 Nov  6 20:35 liblept.so.5.0.0
-rw-r--r-- 1 root staff 87030250 Nov  6 20:45 libtesseract.a
-rwxr-xr-x 1 root staff      987 Nov  6 20:45 libtesseract.la
lrwxrwxrwx 1 root staff       21 Nov  6 20:45 libtesseract.so -> libtesseract.so.3.0.4
lrwxrwxrwx 1 root staff       21 Nov  6 20:45 libtesseract.so.3 -> libtesseract.so.3.0.4
-rwxr-xr-x 1 root staff 30937064 Nov  6 20:45 libtesseract.so.3.0.4
drwxr-sr-x 2 root staff     4096 Nov  6 20:45 pkgconfig
root@f456604ccbed:~# which tesseract
/usr/local/bin/tesseract
root@f456604ccbed:~#

まあたぶんtraineddataが無いのでtesseractコマンド自体は失敗すると予想される。今回の目的は「OSのパッケージマネージャを使わず、tesseract/leptonicaのヘッダファイルとコンパイル済み.soファイルの入手」だったので、とりあえず目的達成できたと思う。

番外: tesseractコマンドの挙動確認

root@f456604ccbed:~# export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
root@f456604ccbed:~# tesseract --list-langs
Error opening data file /usr/local/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'eng'
Tesseract could not load any languages!
Could not initialize tesseract.
root@f456604ccbed:~#

予想通り、eng.traineddataが無いと言われる。

root@f456604ccbed:~# mkdir -p data/tessdata
root@f456604ccbed:~# wget https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata?raw=true
root@f456604ccbed:~# pwd
/root
root@f456604ccbed:~# mv eng.traineddata\?raw\=true /root/data/tessdata/eng.traineddata
root@f456604ccbed:~# export TESSDATA_PREFIX=/root/data
root@f456604ccbed:~# tesseract --list-langs
List of available languages (1):
eng
root@f456604ccbed:~#

traineddataの配置と認識確認できた。

root@f456604ccbed:~# cd
root@f456604ccbed:~# wget https://cloud.githubusercontent.com/assets/931554/20041852/bda107d4-a46f-11e6-8c49-6d022007e445.jpg -O sample.jpg
root@f456604ccbed:~# tesseract sample.jpg stdout
Error in pixReadMemJpeg: function not present
Error in pixReadMem: jpeg: no pix returned
Error during processing.
root@f456604ccbed:~#

むむ。

stackoverflow.com

Leptonicaを入れる前にlibjpegを入れる必要があったっぽい。このへんでもう別コンテナで仕切り直したいな、という気持ちがある。

root@f456604ccbed:~# cd /root/workspace/
root@f456604ccbed:~/workspace# wget https://github.com/LuaDist/libjpeg/archive/8.4.0.tar.gz
root@f456604ccbed:~/workspace# tar -zxvf 8.4.0.tar.gz
root@f456604ccbed:~/workspace# cd libjpeg-8.4.0
root@f456604ccbed:~/workspace/libjpeg-8.4.0# configure
root@f456604ccbed:~/workspace/libjpeg-8.4.0# make
root@f456604ccbed:~/workspace/libjpeg-8.4.0# make install
# 中略
----------------------------------------------------------------------
Libraries have been installed in:
   /usr/local/lib

If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR`
flag during linking and do at least one of the following:
   - add LIBDIR to the `LD_LIBRARY_PATH` environment variable
     during execution
   - add LIBDIR to the `LD_RUN_PATH` environment variable
     during linking
   - use the `-Wl,-rpath -Wl,LIBDIR` linker flag
   - have your system administrator add LIBDIR to `/etc/ld.so.conf`

See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------

で、もっかいleptonicaのmakeをする

root@f456604ccbed:~# cd /root/workspace/leptonica-1.7
root@f456604ccbed:~/workspace/leptonica-1.73# ./configure
root@f456604ccbed:~/workspace/leptonica-1.73# make
root@f456604ccbed:~/workspace/leptonica-1.73# make install

これでどうや

root@f456604ccbed:~/workspace/leptonica-1.73# cd
root@f456604ccbed:~# tesseract sample.jpg stdout
Error in pixGenHalftoneMask: pix too small: w = 173, h = 64
otiai’lO / gosseract

root@f456604ccbed:~#

f:id:otiai10:20161107062656j:plain

これが

f:id:otiai10:20161107062807p:plain

よっしゃ!

これで、OSのパッケージマネージャを使わず、make/make installでTesseract-OCRが動く環境を確認できた。

DRYな備忘録として