Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Papermerge
May 16, 2020
Contents
1 Requirements 31.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Imagemagick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Poppler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Tesseract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Installation 52.1 OS Specific Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 1. Web App + Workers Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.1.1 Ubuntu Bionic 18.04 (LTS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 2. Web App Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.2.1 Ubuntu Bionic 18.04 (LTS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 3. Worker Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Manual Way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Package Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Web App . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.3 Worker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.4 Recurring Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Systemd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.1 Package Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.2 Web App . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Docker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.5 Ansible (Semiautomated) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.6 Jenkins + Ansible (Fully Automated Deployment) . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Languages Support 15
4 REST API 174.1 How It Works? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.1 Get a Token . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.1.2 Use the Token . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 REST API Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5 Page Management 235.1 Delete Page(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Reorder Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
i
5.3 Cut & Paste . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6 Settings 256.1 STORAGE_ROOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256.2 S3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.3 OCR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.4 DATABASES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.5 STATICFILES_DIRS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7 Developers Guide 277.1 Contributing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.1.1 Fix a Typo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277.1.2 Open an Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277.1.3 Add Your Language Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277.2.1 1. Frontend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287.2.2 2. Backend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297.2.3 3. Workers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.3 Branching Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297.3.1 Worker, Papermege-js Branching Model? . . . . . . . . . . . . . . . . . . . . . . . . . . . 317.3.2 Git Branching/Tagging Blitz Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.4 Language Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317.4.1 What is Language Support? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317.4.2 User Interface Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327.4.3 Document Content Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8 Indices and tables 33
ii
Papermerge
I have nothing against paper. Paper is a brilliant invention of humanity. But in the 21st century I find it more appropriatefor paper-based documents to be digitized (scanned). Once scanned, appropriate software can be used to find anydocument in a fraction of a second, just by typing a few keywords.
Papermerge is a document management system designed to work with scanned documents. As well as OCR withfull text search, it provides the look and feel of major modern file browsers, with a hierarchical structure for files andfolders, so that you can organize your documents in a similar way to Dropbox (via web) or Google Drive.
Contents 1
Papermerge
2 Contents
CHAPTER 1
Requirements
Papermerge depends on following software:
• Python >= 3.8.0
• Tesseract - because of OCR
• Imagemagick - Image operations
• Poppler - PDF operations
• PostgreSQL >= 11.0 because of Full Text Search
1.1 Python
Papermerge is a Python 3 application.
1.2 Imagemagick
Papermerge uses Imagemagick to convert between images format
1.3 Poppler
More exactly poppler utils are used. For exampple pdfinfo command line utility is used to find out number of page inPDF document.
3
Papermerge
1.4 Tesseract
If you never heard of Tesseract software - it is google’s open source Optical Character Recognition software. It extractstext from images. It works fantastically well for wide range of languages.
1.5 Database
One of Papermerge’s core philosophies is “Find Any Document”. PostgreSQL database comes with Full Text SearchSupport (FTS) out of the box. Papermerge uses websearch_to_tsquery PostgreSQL function which was intro-duced in PostgreSQL version 11.0.
With FTS - full text search - you can search documents in similar way people are used to search web pages in google(bing, yandex, duckduckgo) search engine - you just type some words - and search result will display only documentswith those words sorted by their relevancy.
4 Chapter 1. Requirements
CHAPTER 2
Installation
There are different methods to install Papermerge. They differ by amount of effort required and purpose.
2.1 OS Specific Packages
Here are given instructions on how to install operating system specific packages. There are three cases.
1. Both web app and workers are on same machine
2. Web app machine
3. Worker machine
2.1.1 1. Web App + Workers Machine
2.1.1.1 Ubuntu Bionic 18.04 (LTS)
Install required ubuntu packages:
sudo apt-get updatesudo apt-get install python3 python3-pip python3-venv \
poppler-utils \imagemagick \build-essential \poppler-utils \tesseract-ocr \tesseract-ocr-deu \tesseract-ocr-eng
Notice that for tesseract only english and german (Deutsch) language packages are needed.
Ubuntu Bionic 18.04 comes with postgres 10 package. Papermerge on the other hand requires at least version 11 ofPostgres.
5
Papermerge
Install Postgres version 11:
# add the repositorysudo tee /etc/apt/sources.list.d/pgdg.list <<ENDdeb http://apt.postgresql.org/pub/repos/apt/ bionic-pgdg mainEND
# get the signing key and import itwget https://www.postgresql.org/media/keys/ACCC4CF8.ascsudo apt-key add ACCC4CF8.asc
# fetch the metadata from the new reposudo apt-get update
2.1.2 2. Web App Machine
Tesseract should not run on Web App only computer.
2.1.2.1 Ubuntu Bionic 18.04 (LTS)
Install required ubuntu packages:
sudo apt-get updatesudo apt-get install python3 python3-pip python3-venv \
poppler-utils \imagemagick \build-essential \poppler-utils
2.1.3 3. Worker Machine
Worker is the one performing heavy task of extracting text from images. So it must have tesseract packages installed.
2.2 Manual Way
Papermerge has two parts:
• Web application
• Worker - which is used for OCR operation
With this installation method both parts will run on the same computer. This installation method is suitable fordevelopers. In this method no configuration is automated, so it is a perfect method if you want to understand themechanics of the project.
If you follow along in this document and still have trouble, please open an issue on GitHub: so I can fill in the gaps.
2.2.1 Package Dependencies
In this setup, Web App and Workers run on the same machine.
Install os specific packages for webapp + worker
6 Chapter 2. Installation
Papermerge
Check that Postgres version 11 is is up and running:
sudo systemctl status [email protected]
Create new role for postgres database:
sudo -u postgres createuser --interactive
When asked Shall the new role be allowed to create databases? please answer yes (when running tests, django createsa temporary database)
Create new database owned by previously created user:
sudo -u postgres createdb -O <user-created-in-prev-step> <dbname>
Set a password for user:
sudo -u postgres psqlALTER USER <username> WITH PASSWORD '<password>';
2.2.2 Web App
Once we have prepared database, tesseract and other dependencies, let’s start with paperpermerge itself.
Clone main papermerge project:
git clone https://github.com/ciur/papermerge papermerge-proj
Clone papermerge-js project (this is the frontend part):
git clone https://github.com/ciur/papermerge-js
Create python’s virtual environment .env:
cd papermerge-projpython3 -m venv .venv
Activate python’s virtual environment:
source .venv/bin/activate
Install required python packages (now you are in papermerge-proj directory):
# while in <papermerge-proj> folderpip install -r requirements.txt
Rename file config/settings/development.example.py to config/settings/development.py. This file is default forDJANGO_SETTINGS_MODULE and it is included in .gitignore.
Adjust following settings in config/settings/development.py:
• DATABASES - name, username and password of database you created in PostgreSQL
• STATICFILES_DIRS - include path to <absolute_path_to_papermerge_js_clone>/static
• MEDIA_ROOT - absolute path to media folder
• STORAGE_ROOT- absolute path to same media root, but with a “local:/” prefix
2.2. Manual Way 7
Papermerge
Note:
1. Make sure that data_folder_in and data_folder_out point to the same location.
2. Make sure that folder pointed by data_folder_in and data_folder_out exists.
Then, as in any django based project, run migrations, create super user and run build in webserver:
cd <papermerge-proj>./manage.py migrate./manage.py createsuperuser./manage.py runserver
At this point, you should be able to see (styled) login page. You should be able as well to login with administrativeuser you created before with ./manage.py createsuperuser command.
At this step, must be able to access login screen and it should look like in screenshot below.
Also, you can upload some document and see their preview.
But because there is no worker configured yet, documents are basically plain images. Let’s configure worker!
8 Chapter 2. Installation
Papermerge
2.2.3 Worker
Let’s add a worker on the same machine with Web Application we configured above. We will use the same python’svirtual environment as for Web Application.
Note: Workers are the ones who depend on (and use) tesseract not Web App.
Clone repo and install (in same python’s virtual environment as Web App) required packages:
git clone https://github.com/ciur/papermerge-workercd papermerge-workerpip install -r requirements.txt
Create a file <papermerge-worker>/config.py with following configuration:
worker_concurrency = 1broker_url = "filesystem://"broker_transport_options = {
'data_folder_in': '/home/vagrant/papermerge-proj/run/broker/data_in','data_folder_out': '/home/vagrant/papermerge-proj/run/broker/data_in',
}worker_hijack_root_logger = Truetask_default_exchange = 'papermerge'task_ignore_result = Falseresult_expires = 86400result_backend = 'rpc://'include = 'pmworker.tasks'accept_content = ['pickle', 'json']s3_storage = 's3:/<not_used>'local_storage = "local:/home/vagrant/papermerge-proj/run/media/"
2.2. Manual Way 9
Papermerge
Important: Folder pointed by data_folder_in and data_folder_out must exists and be the same one asin configuration for Web Application.
Now, while in <papermerge-worker> folder, run command:
CELERY_CONFIG_MODULE=config celery worker -A pmworker.celery -Q papermerge -l info
At this stage, if you keep both built in webserver (./manage.py runserver command above) and worker running inforeground and upload a couple of PDF documents, and obvisouly give worker few minutes time to OCR the document,document becomes more than an image - you can now select text in it!
Fig. 1: Now you should be able to select text
2.2.4 Recurring Commands
At this point, if you will try to search a document - nothing will show up in search results. It is because, workers OCRa document and place results into a .txt file, thus extracted text is not yet in database.
A special Papermerge command txt2db will read .txt files and insert them in associated documents’ (documents’pages) database entries.
Afterwards another command update_fts will prepare a special a database column with correct information aboutdocument (more precicely - page).
Run commands manually:
10 Chapter 2. Installation
Papermerge
cd <papermerge-proj>./manage.py txt2db./manage.py update_fts
Note: In manual setup (i.e. without any Papermerge’s background services running), if you want a document tobe available for search, you need to run ./manage.py txt2db and ./manage.py update_fts commandseverytime after document is OCRed.
2.3 Systemd
In this installation method you use a special papermerge command startetc to generate a bunch of configurationfiles in <papermerge-proj>/run/etc folder. Then only with one single command:
systemctl --user start papermerge
you start a full fledged staging environment with nginx, gunicorn, one worker and recurring commands running asservices on a single machine. I really love this method and I use in my local development environment. This methodrelies on systemd and its --user argument.
2.3.1 Package Dependencies
You will need to install os specific packages for webapp + worker first. Then make sure that PostreSQL is up andrunning.
Make sure that your machine has both nginx and systemd available:
nginx -Vsystemd --version
2.3.2 Web App
Clone main papermerge project:
git clone https://github.com/ciur/papermerge papermerge-proj
Clone papermerge-js project (this is the frontend part):
git clone https://github.com/ciur/papermerge-js
Create python’s virtual environment .env:
cd papermerge-projpython3 -m venv .venv
Activate python’s virtual environment:
source .venv/bin/activate
Install required python packages (now you are in papermerge-proj directory):
2.3. Systemd 11
Papermerge
# while in <papermerge-proj> folderpip install -r requirements.txt
Rename file config/settings/development.example.py to config/settings/development.py. This file is default forDJANGO_SETTINGS_MODULE and it is included in .gitignore.
Adjust following settings in config/settings/development.py:
• DATABASES - name, username and password of database you created in PostgreSQL
• MEDIA_ROOT - absolute path to media folder
• STORAGE_ROOT- absolute path to same media root, but with a “local:/” prefix
Note:
1. Make sure that data_folder_in and data_folder_out point to the same location.
2. Make sure that folder pointed by data_folder_in and data_folder_out exists.
Then, as in any django based project, run migrations and create super user:
cd <papermerge-proj>./manage.py migrate./manage.py createsuperuser
Run startetc command:
./manage.py startetc
Just out of curiousity, have a look <papermerge-proj>/run at folder generated by startetc command. Folder<papermerge-proj> should have following structure:
runbroker
data_indata_outdata_processed
etcgunicorn.conf.pynginx.confpapermerge.envpmworker.envpmworker.pysystemd
papermerge.servicepapermerge.targetpm_nginx.servicepmworker.servicetxt2db.servicetxt2db.timerupdate_fts.serviceupdate_fts.timer
logtmp
Systemd can be used to manage user services. For that –user flag is used. User services must be referenced in ~/.config/systemd/user folder. By the way, I made a video about systemd –user feature.
12 Chapter 2. Installation
Papermerge
Create ~/.config/systemd/user if you don’t have it. Then reference (create symbolic links)<papermerge-proj>/run/etc/systemd/ units in ~/.config/systemd/user folder:
cd ~/.config/systemd/userln -s <papermerge-proj>/run/etc/systemd/* .
Important: Path <papermerge-proj>/run/etc/systemd/* must be absolute.
Start papermerge:
systemctl --user start papermerge.target
2.4 Docker
With this method you will need git, docker and docker-compose installed.
1. Install Docker
2. Install docker-compose
3. Clone Papermerge Repository:
git clone https://github.com/ciur/papermerge papermerge-proj
4. Run docker compose command (which will pull images from DockerHub):
cd papermerge-proj/dockerdocker-compose up -d
This will pull and start the necessary containers. If you wish, you can use docker-compose up --build -fdocker-compose-dev.yml -d command instead to build local images.
Check if services are up and running:
docker-compose ps
Papermerge Web Service is available at http://localhost:8000 For initial sign in use:
URL: http://localhost:8000username: adminpassword: admin
You can check logs of each service with:
docker-compose logs workerdocker-compose logs appdocker-compose logs db
2.5 Ansible (Semiautomated)
Coming soon. . .
2.4. Docker 13
Papermerge
2.6 Jenkins + Ansible (Fully Automated Deployment)
To be added. . .
14 Chapter 2. Installation
CHAPTER 3
Languages Support
Theorethically all languages supported by tesseract (over 130) can be used.
But for my own needs only two were required:
• German
• English
Thus, only support for these two languages is provided. Both localization (of user interface) and OCRing documentsin german and english are basically hardcoded into the project.
15
Papermerge
16 Chapter 3. Languages Support
CHAPTER 4
REST API
Screencast demo
REST API is a way to interact with Papermerge far beyond Web Browser realm. It gives you power to extend Paper-merge in many interesting ways. For example it allows you to write a simple bash script to automate uploading of filesfrom your local (or remote) computer’s specific location.
Another practical scenario where REST API can be used is to automatically (well, you need some sort of 3rd partyscript for that) import attached documents from a given email account.
4.1 How It Works?
Instead of usual Sign In, with username and password, via Web Browser, you will sign in with a token (a fancy namefor sequence of numbers and letters) from practically any software which supports http protocol.
Thus, working with REST API is two step process:
1. get a token
2. use the token from 3rd party REST API client
4.1.1 Get a Token
1. Click User Menu (top right corner) -> API Tokens
2. Click New Token
3. You will to decide on number of hours the token will be valid. Default is 4464 hours, which is roughly equivalentof 6 months. Click Save button.
4. After you click Save button, two information messages will be displayed. Write down your token from Remem-ber the token: . . . info window.
17
Papermerge
Fig. 1: “API Tokens” in User Menu (step 1)
Fig. 2: “New token” button (step 2)
18 Chapter 4. REST API
Papermerge
Important: Write down your token. For security reasons, it is will be displayed only once. In picture below, it is theone marked in red.
Important: Tokens are saved in database encrypted. Token’s encrypted version is called digest. In tokens tables(by the way, you can have as many token you like) first column displays first 16 characters of the digest. It is a way toidentify the token. In picture below, token’s digest is marked with green.
4.1.2 Use the Token
Once you have your REST API token, you can use Papermerge with any HTTP client, just remember to include RESTAPI token as header using following format:
Authorization: Token <you token here>
Let’s see some examples with curl. The simpliest REST API call is:
curl -H "Authorization: Token 7502db85f8d40bc7f4f5ab0a4e4fee4a" \<HOST>/api/documents
If get 2XX response, it means your Authorization header and token are correct.
Upload local file to remote host specified with <HOST>:
curl -H "Authorization: Token 7502db85f8d40bc7f4f5ab0a4e4fee4a" \-T /home/eugen/documents/demo/2019/berlin1.pdf \<HOST>/api/document/upload/berlin_x1.pdf
Notice that local file name is berlin1.pdf while it features in url as berlin_x1.pdf. This way I can rename local file.
You can upload files without specifying their remote name, in that case remote file will have same name as local file:
curl -H "Authorization: Token 7502db85f8d40bc7f4f5ab0a4e4fee4a" \-T /home/eugen/documents/demo/2019/berlin1.pdf \<HOST>/api/document/upload/
Note: Notice the trailing / character. When uploading file with curl without specifing file name URL must endwith /. This is a way to notify curl that we don’t want to rename files.
Your (REST API) uploaded files will end up in Inbox.
4.2 REST API Reference
REST API authorization header:
• name: Authorization
• value format Token <your-token-here>
Example:
4.2. REST API Reference 19
Papermerge
Fig. 3: In red color is your (example) token (step 4)
20 Chapter 4. REST API
Papermerge
Fig. 4: Files uploaded with REST API end up in Inbox.
4.2. REST API Reference 21
Papermerge
curl ... -H "Authorization: Token <your-token-here>"
REST API URLs:
URL HTTP Method Description/api/documents GET json list of all documents/api/document/<id> GET json info about document with id=<id>/api/document/upload/ PUT Uploads unnamed file (random name will be assigned)/api/document/upload/<filename> PUT Uploads named file
22 Chapter 4. REST API
CHAPTER 5
Page Management
Screencast demo
Page management is new set of features of Papermerge to manage pages. In other words you can delete, reorder, cutand paste pages.
Many times scanning documents in bulk results in documents with blank pages; some pages my be out of order ormaybe part of totally different document. Even if user notices these flaws immediately it is time consuming andfrustrating to redo scanning process. Thus it is a welcome feature of Papermerge to allow user to fix out of order pagesin application.
5.1 Delete Page(s)
Delete those blank pages. Although my scanner has automatic “remove blank pages” feature, it misses some blankpage. So I find it very practical to allow user to remove blank pages by himself/herself.
5.2 Reorder Pages
Out of order pages occur very often during scanning process. Papermerge allows users to change pages order withinthe document.
5.3 Cut & Paste
You can move document pages around from one document to another. Once you cut one or several pages from adocument, you can paste them either inside another document - pages will become part of new document or you canpaste pages in file browser, this will create entirely new document from cut pages.
23
Papermerge
24 Chapter 5. Page Management
CHAPTER 6
Settings
These are configurations settings for Papermerge - Web App. Configuration settings are used in same manner as forany Django based project.
Settings which are common for all environments (production, development, staging) are defined in papermerge.config.settings.base module.
If you want to reuse papermerge.config.settings.base, create python file, for example staging.py,and import all settings from base module:
from .base import *
DEBUG = FalseSTATIC_ROOT = '/www/static/'
Example above assumes that staging.py was created in same folder with base.py. Don’t forget to pointDJANGO_SETTINGS_MODULE environment variable to your settings module.
6.1 STORAGE_ROOT
• local:/<path to local folder>
• s3:/<path to bucket>
Defines either local or a remote location where documents are stored. In case of local, it’s meaning is same of forDjango’s MEDIA_ROOT. In case of s3 storage it indicates path to the S3 bucket.
Examples:
STORAGE_ROOT = 'local:/home/vagrant/papermerge-proj/run/media' # good for→˓development envSTORAGE_ROOT = 's3:/yourbucketname/alldocuments' # suitable for production
25
Papermerge
Note: In case when you choose not to use S3 storage both STORAGE_ROOT needs to be set to local://... pathand S3 option must be set to False. And other way around, if you want to use S3 storage, both SOTRAGE_ROOT andS3 needs to be set accordingly (S3=True, STORAGE_ROOT=’s3:/bucketname’).
6.2 S3
• True|False
Instructs papermerge if you want to use S3 storage. S3=True is more suitable for production environ-ments.
Note: In case S3=True you need to point ref:STORAGE_ROOT to s3 location.
6.3 OCR
• True|False
Enables or disables OCR features. With OCR=False no workers needs to be configured;
6.4 DATABASES
This is Django specific configuration settings. Papermerge uses PostgreSQL as database, which meansthat ENGINE options must be set to django.db.backends.postgresql. Example:
DATABASES = {'default': {
'NAME': 'db_name','ENGINE': 'django.db.backends.postgresql','USER': 'db_user','PASSWORD': 'db_password'
},}
6.5 STATICFILES_DIRS
Include absolute path where papermege-js static files are.
Example:
STATICFILES_DIRS = ['/home/vagrant/papermerge-js/static'
]
26 Chapter 6. Settings
CHAPTER 7
Developers Guide
Documentation, notes and general info for developers (myself included).
7.1 Contributing
This documents describes in detail how you can contribute to papermerge project.
7.1.1 Fix a Typo
Contribute to the project just by fixing text typos. Like tis one. Yes, English is not my native lnguage and I do lots oftypoz.
Fixing documentation typos is easiest and fastest way to contribute to the project. Even even if you correct one minortyping mistake I will add you to the list of contributors.
7.1.2 Open an Issue
Another way to contribute is open issues. Obviously this means you need to at least run once application and test it.
7.1.3 Add Your Language Support
Adding language support is not as trivial as fixing a typo or opening an issue, but it is not that difficult either. In anycase, there is a separate page in developer guide for it.
7.2 Design
A brief description of the architecture of Papermerge and why such design decisions were taken. Papermerge projecthas 2 parts:
27
Papermerge
• Web Application
• Workers
Web application is further devided into Frontend and Backend. As result there are 3 separate repositories that are partof one whole.
Fig. 1: High level design. Backend and frontend are separate.
7.2.1 1. Frontend
Papermerge-js Repository
Warning: Name papermerge-js is misleading, because it implies that it is only javascript is used, which is nottrue. This project manages all static assets: javascript, css, images, fonts.
Modern web applications tend to use a lot of javascript and css. Javascript code, as opposite to code written in Python,become increasingly difficult to manage. Same is for css. To deal with codebase complexity, I decided to split frontendas completely separate project. This project is a Webpack project. In practice this makes it little bit easier to deal withgrowing javascript code complexity. The outcome of this project, among others, are two important files:
<papermerge-js>/static/js/papermerge.js<papermerge-js>/static/css/papermerge.css
There are static files as well, like images and fonts. However images and fonts, are just placed in<papermerge-js>/static and nothing really interesting happens with them.
28 Chapter 7. Developers Guide
Papermerge
7.2.2 2. Backend
Papermerge-proj Repository
Backend is a standard Django application. It uses static files from frontend part. Throughout documentation it isrefered as backend because term webapp is more general (webapp = backend + frontend).
7.2.3 3. Workers
Papermerge-worker Repository
Workers perform OCR on the documents. Documents are passed as reference (see note below) from backend to theworkers via a shared location. In simplest setup when everything runs on same machine, shared location is just a folderon local machine accessible by worker and by backend. In production, shared location is a S3 bucket.
Note: There are at least two distinct methods of passing documents from backend to the workers. First method, whichis very simple, but wrong: backend will just transfer entire document byte by byte to the worker. Without diving deepinto technical details, this method is not scalable because it deplets backend’s memory very quickly.
Backend instead instructs workers which documents they need to OCR by telling workers document id (it passes userid and language name as well).
Fig. 2: Backend passes documents to workers by reference.
7.3 Branching Model
All current development goes into master branch.
Papermerge versions branch from master branch and are tagged for specific version. This is easier to explain with apicture.
7.3. Branching Model 29
Papermerge
Fig. 3: Branching model used by Papermerge project.
30 Chapter 7. Developers Guide
Papermerge
• Stable version branches are named stable/1.0.x, stable/1.1.x etc.
• Git tagging is used to mark specific software version e.g. v1.0.0, v1.1.0, v1.2.0 and so on.
7.3.1 Worker, Papermege-js Branching Model?
Well, both worker and papermerge-js will follow the same model.
Note: I started above described branching model somewhere around 14th February 2020 and I have applied it onlyon main project - unfortunatelly at that moment I forgot about other two parts.
As temporary workaround I tagged both worker and papermerge-js with v1.1.0 tags to mark their compatility point intime with main project.
7.3.2 Git Branching/Tagging Blitz Introduction
To checkout a branch stable/1.1.x, use command:
$ git checkout stable/1.1.x
To checkout a tagged commit, say a commit tagged v1.1.0, you use same command as checking out a branch:
$ git checkout v1.1.0
7.4 Language Support
By default, papermerge is hardcoded to work with documents in only two languages - German and English. The-oretically it can support more than 100 languages. However, I, as developer and user of this software, included inpapermerge only what was usefull for me (German and English).
You can contribute to this project by adding support (and testing it) for you own language. It is extremely rewardingexperience, because:
• it is fun
• you will learn a lot
• you will create something useful for you and others
7.4.1 What is Language Support?
There are two parts to consider:
• User Interface language (text like username, Log out)
• Document Content (actual content of your documents)
7.4. Language Support 31
Papermerge
7.4.2 User Interface Language
User Interface language is text you user sees and interacts with. Say labels for username in German will be Benutzer-name, or text for Log out in German is Abmelden. To localize user interface (UI) in your own language you need befamiliar with Django way. It is because main web application is Django project.
Contributing to the project in this sense means basically creating/updating file paperme-rge/<langcode>/LC_MESSAGES/django.po file.
7.4.3 Document Content Language
Every document upload to papermerge will be OCRed by tesseract command line utility. Tesseract command requires-l <lang> argument - to indicate the language of the document. This is the heart of document language support. Havea look a worker’s shortcuts module extract_hocr and extract_txt functions. Both functions built tesseract commandwith language as first argument.
To check what languages you have installed for tesseract, use command:
$ tesseract --list-langs
In my case, it lists deu and eng - which are codes for German and English languages.
OCRing of the documents (tesseract -l deu path/to/doc) happens on worker side. I explained this because it is importantto know, but for adding language support - you don’t need to change anything in the worker, because worker onlytakes orders and blindly executes them.
The entry point, for the worker part is task module with it’s ocr_page function. Again, no need to change anythinghere, I mention this only because it is important to know.
First thing you need to have a look into and change is dynamic_preferences_module where configuration for to lan-guage is defined.
You will need to add a new choice in OcrLanguage class. The code for the new language must match language codelisted by tesseract --list-langs. This change will add a new entry in UI and will allow user to choose newlanguage for the document.
But the tricky part is doing the change on database level. The thing is papermerge makes use of PostgreSQL full textsearch feature, which means it needs to store an updated version of tsvector type column. How to create and searchtsvector type columns is described in postgres documentation.
Every time page.text column is changed a database level trigger is fired to updated language specific tsvector column.Triggers for this job are defined in papermerge/core/pgsql/01_triggers.sql file.
Another important language related sql file is papermerge/core/pgsql/03_update_lang_cols.sql. This sql code is exe-cuted periodically by papermerge/core/management/commands/update_fts.py command. It is responsable for movingdocument page.text to page.text_deu or text to text_eng.
Both page.text_eng and page.text_deu are tsvector type columns with preset weight ‘C’.
32 Chapter 7. Developers Guide
CHAPTER 8
Indices and tables
• genindex
• modindex
• search
33