textrecognitiondatagenerator-readthedocs-io-en-latest
textrecognitiondatagenerator-readthedocs-io-en-latest
Documentation
Release latest
Edouard Belval
1 Installation 3
1.1 Official package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 From source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Overview 5
2.1 Most useful arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Getting help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Tutorial 9
3.1 Just generating data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Generating Chinese data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Text distorsions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 A more advanced use case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 Manipulating margins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Module 15
5 Reference 17
5.1 DataGenerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2 BackgroundGenerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.3 ComputerTextGenerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.4 DistorsionGenerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.5 HandwrittenTextGenerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.6 StringGenerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
i
ii
TextRecognitionDataGenerator Documentation, Release latest
Since the name is quite long, all subsequent refrences will be under the acronym TRDG.
If you are new to the project, start with the tutorial section!
Contents 1
TextRecognitionDataGenerator Documentation, Release latest
2 Contents
CHAPTER 1
Installation
If you want to add a new language The easiest way to use the tool is by cloning the official repo.
git clone https://github.com/Belval/TextRecognitionDataGenerator
Then you need to install the dependencies. It is recommended to use a virtual environment for those.
pip3 install -r requirements.txt
If you want to use the handwritten text generation feature, you need to install the -hw dependencies.
pip3 install -r requirements-hw.txt
Once that is done, you can move to the tutorial for tips and tricks on how to use TRDG!
3
TextRecognitionDataGenerator Documentation, Release latest
4 Chapter 1. Installation
CHAPTER 2
Overview
1. -i, --input_file
Use it when the provided dictionaries do not fit your usecase. Each line will become an image, if your -c
parameter is high enough.
2. -c, --count
Self-explanatory parameter, but one you will probably want to change. The default value is 1000.
3. -l, --language
This argument is especially important if you want to generate data using a specific script. It changes the dic-
tionary to be used (-l fr is equivalent to -i dicts/fr.txt), but most importantly it changes the default
fonts to take one that supports the language’s script. Passing a chinese dictionary without changing the language
will cause invalid images to be generated.
4. -t, --thread_count
Another self-explanatory parameter, yet very important as most computers these days ship with a multi-core
CPU. Setting this to -t 8 makes TRDG create 8 processes to generate the data.
5. -f, --format
By default, all generated images will be 32 pixels high (or wide if you use -or 1). Now that might be too
small for you. -f allows you to make bigger images.
As with most CLI tools, TRDG’s help is accessible through the -h argument.
If you need more information on a specific argument, find its definition in the reference. If even that does not do, feel
free to open an issue on the official repository.
5
TextRecognitionDataGenerator Documentation, Release latest
optional arguments:
-h, --help show this help message and exit
--output_dir [OUTPUT_DIR]
The output directory
-i [INPUT_FILE], --input_file [INPUT_FILE]
When set, this argument uses a specified text file as
source for the text
-l [LANGUAGE], --language [LANGUAGE]
The language to use, should be fr (French), en
(English), es (Spanish), de (German), or cn (Chinese).
-c [COUNT], --count [COUNT]
The number of images to be created.
-rs, --random_sequences
Use random sequences as the source text for the
generation. Set '-let','-num','-sym' to use
letters/numbers/symbols. If none specified, using all
three.
-let, --include_letters
Define if random sequences should contain letters.
Only works with -rs
-num, --include_numbers
Define if random sequences should contain numbers.
Only works with -rs
-sym, --include_symbols
Define if random sequences should contain symbols.
Only works with -rs
-w [LENGTH], --length [LENGTH]
Define how many words should be included in each
generated sample. If the text source is Wikipedia,
this is the MINIMUM length
-r, --random Define if the produced string will have variable word
count (with --length being the maximum)
-f [FORMAT], --format [FORMAT]
Define the height of the produced images if
horizontal, else the width
-t [THREAD_COUNT], --thread_count [THREAD_COUNT]
Define the number of thread to use for image
generation
-e [EXTENSION], --extension [EXTENSION]
Define the extension to save the image with
-k [SKEW_ANGLE], --skew_angle [SKEW_ANGLE]
Define skewing angle of the generated text. In
positive degrees
-rk, --random_skew When set, the skew angle will be randomized between
the value set with -k and it's opposite
(continues on next page)
6 Chapter 2. Overview
TextRecognitionDataGenerator Documentation, Release latest
8 Chapter 2. Overview
CHAPTER 3
Tutorial
TextRecognitionDataGenerator comes with an (hopefully) easy to use CLI. The tutorial is actually multiple tutorials,
combined in a single page. Feel free to skip sections that are not relevant to your use case.
Fun fact, you don’t need to use any command line arguments if you want English data generated using multiple fonts.
Indeed, simply running python3 run.py will create 1000 English, single word images in the out/ directory such
as these:
Now maybe 1000 is too many or too few for your usecase. You can add the -c argument to set how many examples
will be generated.
python3 run.py -c 10
As expected, you will find 10 examples in the out/ directory.
9
TextRecognitionDataGenerator Documentation, Release latest
python3 run.py -c 10 -l cn
This will generate 10 samples using the Chinese dictionary that can be found in in dicts/cn.txt:
Since the concept of word in Chinese is a bit trickier, the dictionary is made of single characters (make your own!).
Let’s do this again with -w 5 to get something prettier.
python3 run.py -c 10 -l cn -w 5
Now that looks better, but what’s up with the spacing between the characters? We would rather have no spaces. Add
-sw 0.
python3 run.py -c 10 -l cn -w 5 -sw 0
Asian scripts can be written top to bottom, you might want to add the -or 1 argument to get vertical text.
python3 run.py -c 10 -l cn -w 5 -sw 0 -or 1
10 Chapter 3. Tutorial
TextRecognitionDataGenerator Documentation, Release latest
You can do much and more with TRDG, if you run into a missing feature, simply open an issue.
For those familiar with the process of training a machine learning model, you often have to deal with overfitting,
which is when the model gets too good at predicting the samples in the training data and stops generalizing to unseen
examples. One trick to prevent this is by adding the distorsion to the data.
While TRDG does not dwelve too deeply in augmentations, as many better and more complete libraries already take
care of it, some operations are available for convenience through the -d argument which as 3 possible values:
• 0: None
• 1: Sine wave
• 2: Cosine wave
• 3: Random
python3 run.py -c 5 -w 5 -d 1
python3 run.py -c 5 -w 5 -d 3
Text in the real world is not always black, and most importantly, text in the real world is almost never straight. What
if we want to emulate that?
python3 run.py -c 10 -k 15 -rk -bl 0.5 -rbl -tc '#000000,#888888'
Which can be translated to: generate 10 examples with a skewing angle between -15 and 15 with an added gaussian
blur between 0 and 0.1. Finally, the text color should be picked randomly between black and gray (including all the
colors inbetween).
Sure enough, the output is much more colourful!
The default resolution might be too small to your taste (and I agree). By default the output is 32 pixels high because
it’s the height used by most text recognition papers. Now you can change that with -f 64.
python3 run.py -c 10 -k 15 -rk -bl 0.5 -rbl -tc '#000000,#888888' -f 64
12 Chapter 3. Tutorial
TextRecognitionDataGenerator Documentation, Release latest
TRDG allows you to control margins around the text using two parameters, --margins, --fit. The first one
controls margins, in pretty much the same way the CSS property margin does.
This is the result with no fit and the default (5, 5, 5, 5) margins: python3 run.py -c 1 -i texts/test.
txt
Now we can add --fit to apply a tight crop around the rendered text. This changes the size by removing the added
space for accents: python3 run.py -c 1 -i texts/test.txt --fit
Margins are applied the generated text, so even with 0,0,0,0, if you don’t use --fit you will get an apparence of
margins: python3 run.py -c 1 -i texts/test.txt --margins 0,0,0,0
Now if you add --fit, you get an absolutely no margins: python3 run.py -c 1 -i texts/test.txt
--margins 0,0,0,0 --fit
Margin values are comma separated top,left,bottom,right, so --margins 10,0,10,0 will return verti-
cal margins with tight cropping vertically.
And finally, with all margins: python3 run.py -c 1 -i texts/test.txt --margins 10,10,10,
10 --fit
14 Chapter 3. Tutorial
CHAPTER 4
Module
TRDG is also a module that can be included in your favorite training pipeline. The easiest way to use it, is to import a
generator.
The basic one is GeneratorFromStrings which, as its name indicates, will take a list of strings, and generate an
image and label pair.
If you want to avoid having to maintain dictionaries, you can use GeneratorFromDicts which will use the bun-
dled ones, GeneratorFromRandom which generates random strings, and GeneratorFromWikipedia which
picks random article from Wikipedia as its source for strings.
Here are examples for each of those, respectively:
generator_from_dicts = GeneratorFromDicts()
generator_from_random = GeneratorFromRandom()
generator_from_wikipedia = GeneratorFromWikipedia()
The generators will not raise StopIteration, they will keep generating images until you break out of the loop.
Set a non-negative value for count if that’s an issue
15
TextRecognitionDataGenerator Documentation, Release latest
16 Chapter 4. Module
CHAPTER 5
Reference
Coming soon
5.1 DataGenerator
5.2 BackgroundGenerator
5.3 ComputerTextGenerator
5.4 DistorsionGenerator
5.5 HandwrittenTextGenerator
5.6 StringGenerator
17