Towards Unconstrained End-to-End Text Spotting · image scale features, we propose a simple yet effective RoI masking step. Additionally, we show that predictions from an existing

Towards Unconstrained End-to-End Text Spotting

Siyang Qin, Alessandro Bissacco, Michalis Raptis, Yasuhisa Fujii, Ying XiaoGoogle AI

{qinb,bissacco,mraptis,yasuhisaf,yingxiao}@google.com

Abstract

We propose an end-to-end trainable network that can si-multaneously detect and recognize text of arbitrary shape,making substantial progress on the open problem of readingscene text of irregular shape. We formulate arbitrary shapetext detection as an instance segmentation problem; an at-tention model is then used to decode the textual content ofeach irregularly shaped text region without rectification. Toextract useful irregularly shaped text instance features fromimage scale features, we propose a simple yet effective RoImasking step. Additionally, we show that predictions froman existing multi-step OCR engine can be leveraged as par-tially labeled training data, which leads to significant im-provements in both the detection and recognition accuracyof our model. Our method surpasses the state-of-the-artfor end-to-end recognition tasks on the ICDAR15 (straight)benchmark by 4.6%, and on the Total-Text (curved) bench-mark by more than 16%.

1. IntroductionAutomatically detecting and recognizing text in images

can benefit a large number of practical applications, such asautonomous driving, surveillance, or visual search and canincrease the environmental awareness of visually impairedpeople [44].

Traditional optical character recognition (OCR) pipelinemethods usually partition the scene text reading task intotwo sub-problems, scene text detection and cropped textline recognition. Text detection methods try to spot textinstances (words or lines) in the input image, while textrecognition models take a cropped text patch and decodeits textual content. Since most scene text detection methodsare unable to directly predict the correct text reading direc-tion, an additional direction identification step is necessaryfor successful OCR engines [56].

Despite their long history and great success, the use ofmultiple models within an OCR pipeline engine has sev-eral disadvantages: errors can accumulate in such a cascadewhich may lead to a large fraction of garbage predictions.

BAKE SHOP

PUNALUU

SWEETBREAD

VISITOR CENTER

BOUDIN

SINCE 1849

BAKERY CAFE

SOURDOUGH

B

Figure 1. Our end-to-end model can predict the locations and tran-scriptions of text with arbitrary shape in a single forward pass.

Furthermore, each model in the pipeline depends on the out-puts of the previous step, which makes it hard to jointlymaximize the end-to-end performance, and fine-tune the en-gine with new data or adapt it to a new domain. Finally,maintaining such a cascaded pipeline with data and modeldependencies requires substantial engineering effort.

End-to-end OCR models overcome those disadvantagesand thus have recently started gaining traction in the re-search community [42, 54, 37, 24, 32]. The basic idea be-hind end-to-end OCR is to have the detector and recognizershare the same CNN feature extractor. During training, thedetector and recognizer are jointly optimized; at inferencetime, the model can predict locations and transcriptions ina single forward pass. While producing superior accuracyin straight text reading benchmarks, these methods struggleto generalize and produce convincing results on more chal-lenging datasets with curved text, which arise naturally andfrequently in everyday environments (see Figure 1 for twosuch examples). Handling arbitrary shaped text is a crucialopen problem in order for OCR to move beyond its tradi-tional straight text applications.

In this paper, we propose a simple and flexible end-to-end OCR model based on a Mask R-CNN detector and asequence-to-sequence (seq2seq) attention decoder [3]. Wemake no assumptions on the shape of the text: our modelcan detect and recognize text of arbitrary shape, not just thelimited case of straight lines. The key idea underlying ourmodel is to skip the feature rectification step between thedetector and the recognizer, and directly feed cropped and

arX

iv:1

908.

0923

1v1

[cs

.CV

] 2

4 A

ug 2

019

masked text instance features to the decoder. We show thatour model is able to recognize text in different orientationsand even along curved paths. Our model learns where tostart decoding, and how to update the attention weights tofollow the unrectified text path. Our detector is based onMask R-CNN: for each text instance, it predicts an axis-aligned rectangular bounding box and the correspondingsegmentation mask. Using these, our model works seam-lessly on both straight and curved text paths.

Typically the recognizer requires far more data to trainthan the detector. Unlike the case of multi-step OCR mod-els where cropped text lines (easier to collect and synthe-size) are used to train the recognizer, previous end-to-endmodels demand fully labeled images as training data. Thismakes end-to-end training challenging due to the short offully annotated images. Furthermore, by the time the rec-ognizer has converged, the detector is often substantiallyoverfitted. In this work, we solve both issues by adding ad-ditional large scale partially labeled data which is automat-ically labeled by an existing multi-step OCR engine 1 [4].If an input training sample is partially annotated, only therecognizer branch is trained. We find that this significantlyboosts the performance of our model.

Our method surpasses the previous state-of-the-art re-sults by a large margin on both straight and curved OCRbenchmarks. On the popular and challenging ICDAR15(straight) dataset, our model out-performs the previoushighest by 4.6% on end-to-end F-score. On the Total-Text(curved) dataset, we significantly increase the state-of-the-art by more than 16%.

In summary, the contributions of this paper are three-fold:

• We propose a flexible and powerful end-to-end OCRmodel which is based on Mask R-CNN and atten-tion decoder. Without bells and whistles, our modelachieves state-of-the-art results on both straight andcurved OCR benchmarks.

• We identify feature rectification as a key bottleneckin generalizing to irregular shaped text, and introducea simple technique (RoI masking) that makes recti-fication unnecessary for the recognizer. This allowsthe attention decoder to directly operate on arbitrarilyshaped text instances.

• To the best of our knowledge, this is the first workto show that end-to-end training can benefit from par-tially labeled data bootstrapped from an existing multi-step OCR engine.

1Publicly available via Google Cloud Vision API.

2. Related WorkIn this section, we briefly review the existing text detec-

tion and recognition methods, and highlight the differencesbetween our method and current end-to-end models. For amore detailed review, the reader is referred to [40].Scene Text Detection: Over the years, the traditional slid-ing window based methods [28, 8, 62] and connected-component based methods [5, 26, 46, 45, 13, 47] havebeen replaced by deep learning inspired methods with asimplified pipeline. These newer methods have absorbedthe advances from general object detection[36, 50, 49] andsemantic segmentation [39, 7] algorithms, adding well-designed modifications specific to text detection. Mod-ern scene text detection algorithms can directly predict ori-ented rectangular bounding boxes or tighter quadrilateralsvia either single-shot [33, 61, 22, 38], or two-stage models[29, 34, 43, 48].

Recently, detecting curved text in images has become anemerging topic: a new dataset containing curved text wasintroduced in [11] which provides tight polygon boundingboxes and ground-truth transcriptions. In [12], Dai et al.formulate text detection as an instance segmentation prob-lem, and in [41], the authors proposed representing a textinstance as a sequence of ordered, overlapping disks, whichis able to cover curved cases. Despite the success in de-tecting curved text, reading the curved text is an unsolvedproblem.Scene Text Recognition: The goal of scene text recog-nition algorithms is to decode the textual content fromcropped text patches. Modern scene text recognition meth-ods can be grouped into two main categories, CTC (Con-nectionist Temporal Classification [19]) based methods [52,23, 14] and attention based methods [57, 9, 10, 16, 31, 53].Most scene text recognition methods from both categoriesassume the input text is rectified (straight line, read from leftto right): the input is first resized to have a constant height,then fed to a fully convolutional network to extract features.To capture long range sequence context, some CTC-basedarchitectures stack an RNN on top of a CNN, while oth-ers use stacked convolution layers, with a large receptivefield. Finally, each feature column predicts a symbol andduplicated symbols are removed to produce the final pre-diction. In attention models, RNN is often used to predictone symbol per step based on the prediction at the previousstep, the hidden state, and a weighted sum of the extractedimage features (context). The process stops when the end-of-sequence symbol is predicted, or the maximum numberof iterations is reached.End-to-End OCR: The work from Li et al. [32] is the firstsuccessful end-to-end OCR model which only supports hor-izontal text. The multi-oriented end-to-end OCR architec-tures of Liu et al. [37], He et al. [24] and Sun et al. [54]share a common idea: they feed rectified text region fea-

CNN

Mask R-CNN

W

H

W/8

W/4

H/8

H/4conv

W/4

H/4

+

Boxes

Masks

×Attention Decoder

conv upsample

PolygonFitting

BIKES

DIRECT

Detector

Recognizer

RoIMasking

Figure 2. Overall architecture of our end-to-end OCR model.

tures to the recognizer to enable end-to-end training. In[37], the model outputs rotated rectangles in the detectionstage, and used a CTC-based recognizer which can not gen-eralize to curved cases. In [54], the detector outputs quadri-laterals and an attention-based model is used to decode thetextual content. In contrast, our detector produces rectangu-lar bounding boxes and the corresponding instance segmen-tation masks, which is a more general way to represent textin arbitrary shape. In addition, we remove the feature recti-fication step which is designed for straight text, and let theattention decoder directly operates on cropped and maskedtext instance features. This leads to better flexibility andperformance in curved text.

Lyu et al. [42] proposed an end-to-end OCR enginewhich is based on Mask R-CNN. They adopted a simplerecognition by detection scheme: in order to recognizethe text, all the characters are detected individually. Thismethod is not ideal because much of the sequential infor-mation is lost. Furthermore, detecting individual charactercan be difficult or even impossible in many cases. And evenif all the characters are correctly detected, it is highly un-clear how to link them into a correct sequence. In [42], theauthors simply group characters from left to right, whichprecludes correct recognition of text in non-traditional read-ing directions. On the other hand, by leveraging sequentialinformation, our method is able to correctly recognize textin more challenging situations and non-traditional readingdirections.

3. Model Architecture and Training

Figures 2 shows the design of our end-to-end OCRmodel. The detector part of the model is based on Mask R-CNN which has been widely used in instance segmentationand other related tasks. For each text region (word or textline), Mask R-CNN can predict an axis-aligned rectangularbounding box and the corresponding instance segmentationmask. For the straight text case, the final detection results

are obtained by fitting a min-area rotated rectangle to eachsegmentation mask, while a general polygon is fitted to eachmask for the curved text case. By using Mask R-CNN asthe detector, our model works seamlessly with straight andcurved text paths.

A novel feature of our architecture is that we do not rec-tify the input to the recognizer. This renders traditionalCTC-based decoders unsuitable. Instead, we use a seq2seqmodel (with attention) as recognizer. At each step, the de-coder makes the prediction based on the output and statefrom the previous step, as well as a convex combinationof the text instance features (context). In order to extractarbitrary shaped text instance features from image levelfeatures, we introduce RoI masking which multiplies thecropped features with text instance segmentation masks.This removes neighboring text and background, and ensuresthat the attention decoder will focus only on the current textinstance.

3.1. Feature Extractor

We explore two popular backbone architecture, ResNet-50[21] and Inception-ResNet [55]; the latter model is farlarger, and consequently yields better detection and recog-nition results. Scene text usually has large variance in scale;in order to capture both large and tiny text, the backboneshould provide dense features while maintaining a large re-ceptive field. To achieve this, we follow the suggestionsfrom [25]: both backbones are modified to have an effec-tive output stride of 8. In order to maintain a large receptivefield, atrous convolutions are used to compensate for the re-duced stride.

For ResNet-50, we modified the conv4 12 layer to havestride 1 and use atrous convolution for all subsequent lay-ers. We extract features from the output of the third stage.The Inception-ResNet is modified in a similar way to haveoutput stride 8, taking the output from the second repeated

2Our naming convention follows [25].

block (layer PreAuxLogits).

3.2. Detector

We follow the standard Mask R-CNN implementation.In the first stage, a region proposal network (RPN) is usedto propose a number of candidate text regions of interest(RoIs). In the second stage, each RoI is processed by threeprediction heads: a class prediction head to decide if it istext or not, a bounding box regression head to predict anaxis-aligned rectangular box, and finally a mask predic-tion head to predict the corresponding instance segmenta-tion mask.

The RPN anchors span four scales (64, 128, 256, 512)and three aspect ratios (0.5, 1.0, 2.0); using more scales andaspect ratios may increase the model’s performance at thecost of longer inference time. Non-maximum suppression(NMS) is used to remove highly overlapping proposals withintersection-over-union (IoU) threshold set to 0.7. The top300 proposals are kept. In the second stage, features fromeach RoI are cropped and resized to 28 × 28 followed by a2× 2 max pooling, which lead to 14× 14 features for eachRoI. At training time, RoIs are grouped into mini batches ofsize 64, and then fed to a class prediction head and boundingbox refinement head. A second NMS is performed on topof refined boxes (IoU is set to 0.7). During inference time,the top 100 regions are sent to the mask prediction head.The final detection output is obtained after the final NMSstep, which computes the IoU based on the mask instead ofbounding boxes like the first two NMS steps.

3.3. Multi-Scale Feature Fusion and RoI Masking

In our experiments, we found that stride 8 features andmulti-scale anchors are sufficient for the text detection taskfor both large and small text. However, for text recognition,a finer-grained task, denser features are needed. Inspired bythe feature pyramid network [35], we gradually upsamplelower resolution, but context rich features, and fuse themwith higher resolution features from earlier CNN layers us-ing element-wise addition. A 1 × 1 convolution (with 128channels) is applied to all features to reduce dimension-ality, and to ensure uniform shapes, before element-wiseaddition. This produces a dense feature map which en-codes both local features and longer range contextual in-formation, which can improve recognition performance es-pecially for small text. In practice, we find that fusing fea-tures with stride 8 and 4 leads to the best results. Morespecifically, for ResNet-50, we use features after the first(stride 4), second (stride 8) and third stage (stride 8); thecorresponding receptive field sizes are 35, 99 and 291 re-spectively. For Inception-ResNet, we use features afterlayer Conv2d 4a 3x3 (stride 4), Mixed 5b (stride 8) andPreAuxLogits (stride 8), the corresponding receptive fieldsizes are 23, 63 and 2335 respectively.

�

�

� � × �

1

��

ℎ

��

LSTM

��

��−1

��−1

�

Figure 3. Our seq2seq based recognizer.

In multi-step OCR engines, each text instance is croppedout from the input image before being fed to the recognizer.In contrast, in end-to-end models, instead of cropping outthe image patch, a more involved method is used to ex-tract text instance features from the image level featuresoutput by the backbone CNN. For object detection mod-els, axis-aligned bounding boxes are used to crop out fea-tures [17]. For text, the work in [37] and [54] proposed RoIrotate and perspective RoI transforms to compute rectifiedtext instance features using rotated rectangles or quadrilat-erals. This works well for straight text but fails in the curvedtext case. In this work, we propose a simple and more gen-eral way to extract text instance features that works for anyshape, called RoI masking: first the predicted axis-alignedrectangular bounding boxes are used to crop out features,and then we multiply by the corresponding instance seg-mentation mask. Since we do not know the reading direc-tion of the text at this point, features from each region areresized so that the shorter dimension is equal to 14 whilemaintaining the overall aspect ratio. RoI masking filters outthe neighboring text and background, ensuring that the at-tention decoder will not accidentally focus on areas outsidethe current decoding region. Our ablation experiments inSection 4.3 show that RoI masking substantially improvesthe recognizer’s performance.

3.4. Recognizer

The recognizer is a seq2seq model with Bahdanau-styleattention proposed in [3], shown in Figure 3. At the firststep, the model takes a START symbol and zero LSTMinitial state; we then produce symbols until the End-of-Sequence (EOS) symbol is predicted. At each step, the finalpredicted distribution over possible symbols is given by:

p(yi|y1, ..., yi−1, h) = softmax(Wooi + bo) (1)

Where yi is the predicted character, oi is the LSTM out-put at time step i respectively, and h represents the flattenedextracted text instance features. At each step, the LSTMtakes the prediction of the previous step yi−1, the previoushidden state si−1 and a weighted sum of the image featureci (context) to compute the output oi and new state vectorsi.

(oi, si) = LSTM(yi−1, si−1, ci) (2)

At each step, the decoder is able to pay attention to somespecific image region and use the corresponding image fea-tures to help make the right prediction. The context vectorci is a weighted sum of the flattened image feature h andlearned weight vector αi: ci =

∑j αijhj . The weight vec-

tor αi is defined as:

αij =exp(eij)∑k exp(eik)

(3)

eij = V T tanh(Wssi−1 +Whhj). (4)

The attention weight for each feature position is deter-mined by image feature (h) and previous LSTM state (si−1)which encode the shift of the attention mask. This enablesthe recognizer to follow arbitrary shaped text lines.

By feeding the predicted symbol to the next step, themodel can learn an implicit language model. At inferencetime, the predicted symbol is fed to the next step while theground-truth one is used during training (i.e., teacher forc-ing).

3.5. Joint Training and Loss Function

We observe that the recognizer requires far more dataand training iterations in comparison to the detector; thismakes joint training difficult as the existing public datasetsare not large enough to train a high performance attentiondecoder, especially when the input features are not recti-fied. Furthermore, if we train long enough to achieve con-vergence in the recognizer, there is a strong risk of over-fitting the detector. In this work, we solve both issues byadding additional large scale partially labeled data which isautomatically labeled by an existing multi-stage OCR en-gine from Google Cloud Vision API. If the input trainingsample is fully labeled, we update the weights of both de-tector and recognizer. If it has been automatically annotatedby an OCR engine (and thus may have unlabeled text), onlythe recognizer branch is trained. Thus the total multitaskloss is defined as:

L = δ(Lrpn + αLrcnn + βLmask) + γLrecog. (5)

Here, δ is 1 if the input is fully labeled, 0 otherwise. Inour implementation, both α, β and γ are set to 1.0. Addingmachine labeled, partially labeled data ensures that the rec-ognizer “sees” enough text while preventing the detectorfrom overfitting. For the machine labeled data, since thedetection of all the text is not required, we could increasethe confidence threshold to filter out noisy low confidenceoutputs.

The detection losses are the same as the original MaskR-CNN paper [20]. The recognizer loss Lrecog is the crossentropy loss with label smoothing set to 0.9, as suggested by[57]. During training, the ground-truth boxes and masks areused for RoI cropping and RoI masking while the predictedones are used at inference time. We also tried to use pre-dicted bounding boxes and masks during training but foundno improvement.

3.6. Implementation Details

The data used to train our model contains images fromthe training portion of popular public datasets, includ-ing SynthText, ICDAR15, COCO-Text, ICDAR-MLT andTotal-Text. The number of images we used from eachdataset are 200k, 1k, 17k, 7k and 1255 respectively. Be-sides public datasets, we also collected 30k images fromthe web and manually labeled each word, providing ori-ented rectangular bounding boxes and transcriptions. Thenumber of fully labeled real images is too low to train arobust end-to-end OCR model. To solve this issue, as men-tioned in Section 3.5, we run an existing OCR engine onone million images with text and use the predictions (ori-ented rectangles and transcriptions) as the partially labeledground-truth. Our experiments (see Section 4.3) show thiscan significantly improve the end-to-end performance. Toprevent the large volumes of synthetic and partially labeleddata from dominating the training data, extensive data aug-mentations are applied to fully labeled real images. First,the shorter dimension of input image is resized from 480to 800 pixels, then random rotation, random cropping andaspect ratio jittering are used.

We adopt a single-step training strategy. The backboneCNN is pre-trained on ImageNet; the detector and recog-nizer are jointly optimized using both fully and partiallyannotated images. Our ablation experiment (see Section4.3) shows that this achieves better accuracy than a two-steptraining strategy, where we first use all the fully labeled datato train the detector and then jointly fine-tune the detectorand recognizer using both fully and partially labeled data.We train our model with asynchronous SGD with momen-tum of 0.9. The initial learning rate depends on the back-bone network, 10−3 for Inception-ResNet and 3× 10−4 forResNet-50. We reduce the learning rate by a factor of 3every 2M iterations, with a total number of 8M iterations.During training, each GPU takes a single training sampleper iteration, and 15 Tesla V100 GPUs are used. We im-plement the model using TensorFlow [1], the training pro-cess takes about three days to finish. In the recognizer, weuse a single layer LSTM with 256 hidden units. Recurrentdropout [15] and layer normalization [2] are used to reduceoverfitting. The total number of symbols are 79, which in-cludes digits, upper and lower cases of English charactersas well as several special characters.

Method Detection Method End-to-EndP R F S W G

SSTD [22] 80.23 73.86 76.91 Stradvision [30] 43.70 - -EAST [61] 83.27 78.33 80.72 TextProposals+DictNet [18, 27] 56.0 52.3 49.7

TextSnake [41] 84.9 80.4 82.6 HUST MCLAB [51, 52] 67.86 - -RRD MS [34] 88 80 83.8 E2E-MLT [6] - - 55.1

Mask TextSpotter [42] 91.6 81.0 86.0 Mask TextSpotter [42] 79.3 73.0 62.4TextNet [54] 89.42 85.41 87.37 TextNet [54] 78.66 74.90 60.45He et al. [24] 87 86 87 He et al. [24] 82 77 63

FOTS [37] 91.0 85.17 87.99 FOTS [37] 81.09 75.90 60.80FOTS MS [37] 91.85 87.92 89.84 FOTS MS [37] 83.55 79.11 65.33

Ours (ResNet-50) 89.36 85.75 87.52 Ours (ResNet-50) 83.38 79.94 67.98Ours (Inception-ResNet) 91.67 87.96 89.78 Ours (Inception-ResNet) 85.51 81.91 69.94

Table 1. Comparison on ICDAR15. “MS” represents multi-scale testing. “P”, “R” and “F” stand for precision, recall, and F-score re-spectively. In the end-to-end evaluation, F-score under three lexicon settings are shown. “S” (strong) means 100 words, including theground-truth, are given for each image. For “W” (weak), a lexicon includes all the words appeared in the test set is provided. For “G”, ageneric lexicon with 90k words is given, which is not used by our model.

4. ExperimentsWe evaluate the performance of our model on the IC-

DAR15 benchmark [30] (straight text) and recently intro-duced Total-Text [11] (curved text) dataset.

4.1. Straight Text

We show the superior performance of our model on de-tecting and recognizing oriented straight text using the IC-DAR15 benchmark introduced in Challenge 4 of the IC-DAR 2015 Robust Reading Competition. The dataset con-sists of 1000 training images and 500 testing images. Im-ages in this dataset are captured by wearable cameras, with-out intentional focus on text regions. There are large varia-tions in text size, orientation, font, and lighting conditions.Motion blur is also common. In this dataset, text instancesare labeled at the word level. Quadrilateral bounding boxesand transcriptions are provided. For detection evaluation, aprediction is counted as a true positive if the IoU with theclosest ground-truth is larger than 0.5. For end-to-end eval-uation, the predicted transcription needs to be identical tothe corresponding ground-truth in order to be considered asa true positive. Some unreadable words are marked as “donot care”. The Evaluation metrics of interest are precision(true positives count over detection count), recall (true posi-tives count over ground-truth count), and F-score (harmonicmean of precision and recall).

The results are summarized in Table 1. At inferencetime, the shorter dimension of the image is resized to 900pixels. Note we only use a single scale input. In thedetection only task, our method (with Inception-ResNetbackbone) surpasses the best single scale model (FOTS)by 1.8%. For end-to-end performance, our method out-performs the highest single scale model (He et al.) by about7%. Compared to the multi-scale version of the FOTS

Method Detection E2EP R F None

Baseline [11] 40.0 33.0 36.0 -Textboxes [33] 62.1 45.5 52.5 36.3TextSnake [41] 82.7 74.5 78.4 -

MSR [60] 85.2 73.0 78.6 -TextField [59] 81.2 79.9 80.6 -

FTSN [12] 84.7 78.0 81.3 -Mask TextSpotter [42] 69.0 55.0 61.3 52.9

TextNet [54] 68.21 59.45 63.53 54.0Ours (ResNet-50) 83.3 83.4 83.3 67.8

Ours (Inc-Res public) 86.8 84.3 85.5 63.9Ours (Inc-Res) 87.8 85.0 86.4 70.7

Table 2. Results on Total-Text. No lexicon is used in end-to-end evaluation. “Inc-Res” stands for Inception-Resnet. “Inc-Respublic” represents our model with Inception-ResNet backbone,trained using only public datasets.

model, the current state-of-the-art, our method matches thedetection performance while still achieving 4.6% higherend-to-end F-score.

4.2. Curved Text

The biggest advantage of our method is the outstandingperformance on irregular shaped text. We conducted anexperiment on the recently introduced curved text datasetcalled Total-Text [11]. Total-Text contains 1255 images fortraining and another 300 for testing, with a large number ofcurved text. In each image, text is annotated at word level,each word is labeled by a bounding polygon. Ground truthtranscriptions are provided. The evaluation protocol for de-tection is based on [58], the one for end-to-end recognitionis based on ICDAR15’s end-to-end evaluation protocol (theevaluation script is modified to support general polygons).

THIRSTYBEAR

BREWING

COMPANY

SPANISH

CUISINE

CANNERY ROW

ANTIONE MALL

471 W STREET

SALAD SANDWICHES

CATERING

CAFE

PLACE MARKET

COFFEE WINE

BEER

AND CONVENIENCE

DICKENS

CHARIES FO

THE 78 TROM

FEBRUARY

WAS

JOHN

DICKENS

FATHE(R)

venting the Future

ntelligence. Communications

MARINASQUARE

More Shopping

Dining! &

MARINA LINK B1

BELIEVE

NOURISH.

MOVE.

STARBUCKS COFFEE

PARIS BAGUETTE

Figure 4. Qualitative results of our method on ICDAR15 (first two columns) and Total-Text (last two columns) datasets. In the bottom rightimage, prediction errors are shown in blue, some predictions are skipped for better visualization. All the skipped predictions are correctlypredicted by our method.

During training, we pre-train the model on straight textand fine-tune it using only images from the training portionof the Total-Text dataset. At inference time, the shorter di-mension of each image is resized to 600 pixels. We comparethe results of our model with both backbones against previ-ous work in Table 2. We also list results trained using onlypublicly available datasets for the Inception-ResNet back-bone. Our method out-performs the previous state-of-the-art by a large margin in both detection and end-to-end evalu-ations. Specifically, for detection, our best model surpassesthe previous highest by 5.1%. In the end-to-end recognitiontask, our best model significantly raises the bar by 16.7%.In the absence of our internal fully labeled data and partiallymachine annotated data, our method still achieves far betterperformance in both the detection and recognition tasks, at+4.2% and +9.9% respectively.

Several qualitative examples are shown in Figure 4 (thirdand fourth column). Our method produces high qualitybounding polygons and transcriptions. Surprisingly, we findthat our method can also produce reasonable predictions inpartially occluded cases (top right image, “ANTIONE”) byutilizing visible image features and the learned implicit lan-guage model from the LSTM. In the bottom right image,we show some failure cases, where the text is upside downand read from right to left. These cases are very rare in thetraining data, we believe more aggressive data augmenta-tion may mitigate these issues.

We can visualize the attention weight vector at each stepby reshaping it to 2D and projecting back to image coordi-nates. This provides a great tool to analyze and debug themodel performance. In Figure 5, we find that the seq2seqmodel focuses on the right area when decoding each sym-bol and is able to follow the shape of the text. In the middlerow (last image), we show the attention mask correspond-

ing to the EOS symbol. The attention is spread across boththe start and end positions.

4.3. Ablation Experiments

We conduct a series of ablation experiments to better un-derstand our end-to-end OCR model. In this section, wereport average precision (AP) score, which is often a betterevaluation metric than F-score (which display sensitivity toa specific threshold). We use the ICDAR15 test set in thissection. Table 3 summarizes the experimental results.Baselines: We build a detection-only baseline (first row inTable 3) and an end-to-end baseline (third row in Table 3).In the detection-only baseline, we train a model with onlythe detection branch. In the end-to-end baseline, we traina model with both detection and recognition branches, butdo not use partially labeled data or RoI masking, and adopta single-step training strategy (described in Section 3.6).The end-to-end baseline exhibits stronger detection resultsthan the detection-only baseline (with a ResNet-50 back-bone, the improvement is 1.6%) despite being trained onexactly the same data. This suggests that training a recog-nizer improves the feature extractor for the detection task.Backbones: From Table 3 we find that on the detectiontask, the more powerful Inception-ResNet backbone consis-tently out-performs ResNet-50. On the end-to-end task, ourmodel with the ResNet-50 backbone actually achieves bet-ter performance when the training data is limited (withoutlarge scale partially labeled data). For our full end-to-endmodel, the Inception-ResNet backbone achieves marginalimprovement on end-to-end AP score (59.5% vs. 59.0%).Partially Labeled Data: The use of partially labeled dataprovides significant improvements in end-to-end perfor-mance across all configurations of our model (row 4 vs. row6, or row 3 vs. row 5). Interestingly, it also improves the de-

BIKES BIKES BIKES BIKES BIKES

RISTORANTE RISTORANTE RISTORANTE RISTORANTE <eos>

LIBRARY LIBRARY LIBRARY LIBRARY LIBRARY LIBRARY LIBRARY

Figure 5. Visualization of the attention weights. Some steps are skipped for better visualization.

PD Mask ResNet-50 Inc-ResAPDet APE2E APDet APE2E

Two-step Det-baseline 85.5 - 88.2 -E2E-full X X 86.9 55.3 89.1 57.4

Single-step

E2E-baseline 87.1 52.8 88.2 51.7+ Mask X 86.7 53.9 88.9 53.1+ PD X 87.5 55.7 89.9 58.7

E2E-full X X 87.2 59.0 90.8 59.5

Table 3. Results on the ICDAR15 test set under different modelconfigurations and training strategies. AP numbers are reported.“PD”, “Mask” and “Inc-Res” stand for partially labeled data, RoImasking and Inception-ResNet respectively. “Det-baseline” refersto the first step (training the detector using fully labeled data) ofthe “two-step” training process.

tector without training the detection branch directly (Sec-tion 3.5). Once again, this suggests that we can improve thefeature extractor by receiving training signal through recog-nition branch.RoI Masking: In Table 3, we show the effectiveness ofRoI masking (row 3 vs. row 4, or row 5 vs. row 6). Higherend-to-end AP scores are consistently achieved in the pres-ence of RoI masking (e.g. +3.3% AP with the ResNet-50backbone when using partially labeled data). This demon-strates that the recognizer benefits from RoI masking. Theimprovement is more significant for a lighter weight back-bone with smaller receptive field. For detection perfor-mance, we observe mixed results: a marginal improvementfor the Inception-ResNet backbone, and some degradationwhen using ResNet-50.Training Strategy: Row 2 and row 6 of Table 3 comparethe effect of single-step and two-step training strategies asdescribed in Section 3.6. In single-step training, we jointlyoptimize the detector and recognizer together using both

fully and partially labeled data. In two-step training, we firsttrain the detection-only baseline, and then add a recognizerinto joint training. We find that single-step training consis-tently out-performs two-step training in both detection andend-to-end evaluations. Single step training is far simpler,and makes it easier to apply automatic hyperparameter tun-ing and neural architecture search, which we will study infuture work.

4.4. Speed

For images from the ICDAR15 dataset (with resolution1280×720), the end-to-end inference time is 210 ms for theResNet-50 backbone and 330 ms for the Inception-ResNetbackbone (on a Tesla V100 GPU). If we only run the detec-tion branch, the corresponding inference time are 180 msand 270 ms respectively. Thus, for scene text images, thecomputational overhead of the recognition branch is quitesmall. Sharing the same CNN feature extractor makes end-to-end model more computationally efficient than two-stepmethods.

5. Conclusion

In this paper, we present an end-to-end trainable networkthat can simultaneously detect and recognize text in arbi-trary shape. The use of Mask R-CNN, attention decoder anda simple yet effective RoI masking step leads to a flexibleand high performance model. We also show that end-to-endtraining can benefit from partially machine annotated data.On the ICDAR15 and Total-Text benchmarks, our methodsignificantly surpasses previous methods by a large marginwhile being reasonably efficient.

References[1] Martın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen,

Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe-mawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: Asystem for large-scale machine learning. In 12th {USENIX}Symposium on Operating Systems Design and Implementa-tion ({OSDI} 16), pages 265–283, 2016. 5

[2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-ton. Layer normalization. arXiv preprint arXiv:1607.06450,2016. 5

[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.Neural machine translation by jointly learning to align andtranslate. arXiv preprint arXiv:1409.0473, 2014. 1, 4

[4] Alessandro Bissacco, Mark Cummins, Yuval Netzer, andHartmut Neven. Photoocr: Reading text in uncontrolled con-ditions. In Proceedings of the IEEE International Confer-ence on Computer Vision, pages 785–792, 2013. 2

[5] Michal Busta, Lukas Neumann, and Jiri Matas. Fastext: Effi-cient unconstrained scene text detector. In Proceedings of theIEEE International Conference on Computer Vision, pages1206–1214, 2015. 2

[6] Michal Busta, Yash Patel, and Jiri Matas. E2e-mlt-an uncon-strained end-to-end method for multi-language scene text.arXiv preprint arXiv:1801.09919, 2018. 6

[7] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,Kevin Murphy, and Alan L Yuille. Deeplab: Semantic imagesegmentation with deep convolutional nets, atrous convolu-tion, and fully connected crfs. IEEE transactions on patternanalysis and machine intelligence, 40(4):834–848, 2018. 2

[8] Xiangrong Chen and Alan L Yuille. Detecting and read-ing text in natural scenes. In Proceedings of the 2004 IEEEComputer Society Conference on Computer Vision and Pat-tern Recognition, 2004. CVPR 2004., volume 2, pages II–II.IEEE, 2004. 2

[9] Zhanzhan Cheng, Fan Bai, Yunlu Xu, Gang Zheng, ShiliangPu, and Shuigeng Zhou. Focusing attention: Towards accu-rate text recognition in natural images. In Proceedings of theIEEE International Conference on Computer Vision, pages5076–5084, 2017. 2

[10] Zhanzhan Cheng, Yangliu Xu, Fan Bai, Yi Niu, ShiliangPu, and Shuigeng Zhou. Aon: Towards arbitrarily-orientedtext recognition. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 5571–5579, 2018. 2

[11] Chee Kheng Ch’ng and Chee Seng Chan. Total-text: A com-prehensive dataset for scene text detection and recognition.In 2017 14th IAPR International Conference on DocumentAnalysis and Recognition (ICDAR), volume 1, pages 935–942. IEEE, 2017. 2, 6

[12] Yuchen Dai, Zheng Huang, Yuting Gao, Youxuan Xu, KaiChen, Jie Guo, and Weidong Qiu. Fused text segmenta-tion networks for multi-oriented scene text detection. In2018 24th International Conference on Pattern Recognition(ICPR), pages 3604–3609. IEEE, 2018. 2, 6

[13] Boris Epshtein, Eyal Ofek, and Yonatan Wexler. Detectingtext in natural scenes with stroke width transform. In 2010

IEEE Computer Society Conference on Computer Vision andPattern Recognition, pages 2963–2970. IEEE, 2010. 2

[14] Yasuhisa Fujii, Karel Driesen, Jonathan Baccash, Ash Hurst,and Ashok C Popat. Sequence-to-label script identificationfor multilingual ocr. In 2017 14th IAPR International Con-ference on Document Analysis and Recognition (ICDAR),volume 1, pages 161–168. IEEE, 2017. 2

[15] Yarin Gal and Zoubin Ghahramani. A theoretically groundedapplication of dropout in recurrent neural networks. InAdvances in neural information processing systems, pages1019–1027, 2016. 5

[16] Suman K Ghosh, Ernest Valveny, and Andrew D Bagdanov.Visual attention models for scene text recognition. In 201714th IAPR International Conference on Document Analysisand Recognition (ICDAR), volume 1, pages 943–948. IEEE,2017. 2

[17] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE inter-national conference on computer vision, pages 1440–1448,2015. 4

[18] Lluıs Gomez and Dimosthenis Karatzas. Textproposals: atext-specific selective search algorithm for word spotting inthe wild. Pattern Recognition, 70:60–74, 2017. 6

[19] Alex Graves, Santiago Fernandez, Faustino Gomez, andJurgen Schmidhuber. Connectionist temporal classification:labelling unsegmented sequence data with recurrent neuralnetworks. In Proceedings of the 23rd international confer-ence on Machine learning, pages 369–376. ACM, 2006. 2

[20] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir-shick. Mask r-cnn. In Proceedings of the IEEE internationalconference on computer vision, pages 2961–2969, 2017. 5

[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In Proceed-ings of the IEEE conference on computer vision and patternrecognition, pages 770–778, 2016. 3

[22] Pan He, Weilin Huang, Tong He, Qile Zhu, Yu Qiao, and Xi-aolin Li. Single shot text detector with regional attention. InProceedings of the IEEE International Conference on Com-puter Vision, pages 3047–3055, 2017. 2, 6

[23] Pan He, Weilin Huang, Yu Qiao, Chen Change Loy, and Xi-aoou Tang. Reading scene text in deep convolutional se-quences. In Thirtieth AAAI Conference on Artificial Intel-ligence, 2016. 2

[24] Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao,and Changming Sun. An end-to-end textspotter with explicitalignment and attention. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages5020–5029, 2018. 1, 2, 6

[25] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu,Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wo-jna, Yang Song, Sergio Guadarrama, et al. Speed/accuracytrade-offs for modern convolutional object detectors. In Pro-ceedings of the IEEE conference on computer vision and pat-tern recognition, pages 7310–7311, 2017. 3

[26] Weilin Huang, Zhe Lin, Jianchao Yang, and Jue Wang. Textlocalization in natural images using stroke feature transformand text covariance descriptors. In Proceedings of the IEEEInternational Conference on Computer Vision, pages 1241–1248, 2013. 2

[27] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and An-drew Zisserman. Synthetic data and artificial neural net-works for natural scene text recognition. arXiv preprintarXiv:1406.2227, 2014. 6

[28] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman.Deep features for text spotting. In European conference oncomputer vision, pages 512–528. Springer, 2014. 2

[29] Yingying Jiang, Xiangyu Zhu, Xiaobing Wang, Shuli Yang,Wei Li, Hua Wang, Pei Fu, and Zhenbo Luo. R2cnn: rota-tional region cnn for orientation robust scene text detection.arXiv preprint arXiv:1706.09579, 2017. 2

[30] Dimosthenis Karatzas, Lluis Gomez-Bigorda, AnguelosNicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwa-mura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chan-drasekhar, Shijian Lu, et al. Icdar 2015 competition on robustreading. In 2015 13th International Conference on Docu-ment Analysis and Recognition (ICDAR), pages 1156–1160.IEEE, 2015. 6

[31] Chen-Yu Lee and Simon Osindero. Recursive recurrent netswith attention modeling for ocr in the wild. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 2231–2239, 2016. 2

[32] Hui Li, Peng Wang, and Chunhua Shen. Towards end-to-end text spotting with convolutional recurrent neural net-works. In Proceedings of the IEEE International Conferenceon Computer Vision, pages 5238–5246, 2017. 1, 2

[33] Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang,and Wenyu Liu. Textboxes: A fast text detector with a singledeep neural network. In Thirty-First AAAI Conference onArtificial Intelligence, 2017. 2, 6

[34] Minghui Liao, Zhen Zhu, Baoguang Shi, Gui-song Xia,and Xiang Bai. Rotation-sensitive regression for orientedscene text detection. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 5909–5918, 2018. 2, 6

[35] Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He,Bharath Hariharan, and Serge Belongie. Feature pyramidnetworks for object detection. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 2117–2125, 2017. 4

[36] Wei Liu, Dragomir Anguelov, Dumitru Erhan, ChristianSzegedy, Scott Reed, Cheng-Yang Fu, and Alexander CBerg. Ssd: Single shot multibox detector. In European con-ference on computer vision, pages 21–37. Springer, 2016. 2

[37] Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, andJunjie Yan. Fots: Fast oriented text spotting with a uni-fied network. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 5676–5685,2018. 1, 2, 3, 4, 6

[38] Yuliang Liu and Lianwen Jin. Deep matching prior network:Toward tighter multi-oriented text detection. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 1962–1969, 2017. 2

[39] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fullyconvolutional networks for semantic segmentation. In Pro-ceedings of the IEEE conference on computer vision and pat-tern recognition, pages 3431–3440, 2015. 2

[40] Shangbang Long, Xin He, and Cong Ya. Scene text detec-tion and recognition: The deep learning era. arXiv preprintarXiv:1811.04256, 2018. 2

[41] Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He,Wenhao Wu, and Cong Yao. Textsnake: A flexible represen-tation for detecting text of arbitrary shapes. In Proceedingsof the European Conference on Computer Vision (ECCV),pages 20–36, 2018. 2, 6

[42] Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, andXiang Bai. Mask textspotter: An end-to-end trainable neu-ral network for spotting text with arbitrary shapes. In Pro-ceedings of the European Conference on Computer Vision(ECCV), pages 67–83, 2018. 1, 3, 6

[43] Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, Hong Wang,Yingbin Zheng, and Xiangyang Xue. Arbitrary-orientedscene text detection via rotation proposals. IEEE Transac-tions on Multimedia, 20(11):3111–3122, 2018. 2

[44] Leo Neat, Ren Peng, Siyang Qin, and Roberto Manduchi.Scene text access: A comparison of mobile ocr modalitiesfor blind users. 2019. 1

[45] Lukas Neumann and Jirı Matas. Real-time scene text local-ization and recognition. In 2012 IEEE Conference on Com-puter Vision and Pattern Recognition, pages 3538–3545.IEEE, 2012. 2

[46] Lukas Neumann and Jiri Matas. Scene text localization andrecognition with oriented stroke detection. In Proceedingsof the IEEE International Conference on Computer Vision,pages 97–104, 2013. 2

[47] Siyang Qin and Roberto Manduchi. A fast and robust textspotter. In 2016 IEEE Winter Conference on Applications ofComputer Vision (WACV), pages 1–8. IEEE, 2016. 2

[48] Siyang Qin and Roberto Manduchi. Cascaded segmentation-detection networks for word-level text spotting. In 2017 14thIAPR International Conference on Document Analysis andRecognition (ICDAR), volume 1, pages 1275–1282. IEEE,2017. 2

[49] Joseph Redmon, Santosh Divvala, Ross Girshick, and AliFarhadi. You only look once: Unified, real-time object de-tection. In Proceedings of the IEEE conference on computervision and pattern recognition, pages 779–788, 2016. 2

[50] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster r-cnn: Towards real-time object detection with regionproposal networks. In Advances in neural information pro-cessing systems, pages 91–99, 2015. 2

[51] Baoguang Shi, Xiang Bai, and Serge Belongie. Detectingoriented text in natural images by linking segments. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 2550–2558, 2017. 6

[52] Baoguang Shi, Xiang Bai, and Cong Yao. An end-to-endtrainable neural network for image-based sequence recog-nition and its application to scene text recognition. IEEEtransactions on pattern analysis and machine intelligence,39(11):2298–2304, 2017. 2, 6

[53] Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao,and Xiang Bai. Robust scene text recognition with auto-matic rectification. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 4168–4176, 2016. 2

[54] Yipeng Sun, Chengquan Zhang, Zuming Huang, JiamingLiu, Junyu Han, and Errui Ding. Textnet: Irregular text read-ing from images with an end-to-end trainable network. arXivpreprint arXiv:1812.09900, 2018. 1, 2, 3, 4, 6

[55] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, andAlexander A Alemi. Inception-v4, inception-resnet and theimpact of residual connections on learning. In Thirty-FirstAAAI Conference on Artificial Intelligence, 2017. 3

[56] Jake Walker, Yasuhisa Fujii, and Ashok C Popat. A web-based ocr service for documents. In Proceedings of the 13thIAPR International Workshop on Document Analysis Sys-tems (DAS), Vienna, Austria, 2018. 1

[57] Zbigniew Wojna, Alexander N Gorban, Dar-Shyang Lee,Kevin Murphy, Qian Yu, Yeqing Li, and Julian Ibarz.Attention-based extraction of structured information fromstreet view imagery. In 2017 14th IAPR International Con-ference on Document Analysis and Recognition (ICDAR),volume 1, pages 844–850. IEEE, 2017. 2, 5

[58] Christian Wolf and Jean-Michel Jolion. Object count/areagraphs for the evaluation of object detection and segmenta-tion algorithms. International Journal of Document Analysisand Recognition (IJDAR), 8(4):280–296, 2006. 6

[59] Yongchao Xu, Yukang Wang, Wei Zhou, Yongpan Wang,Zhibo Yang, and Xiang Bai. Textfield: Learning a deep di-rection field for irregular scene text detection. IEEE Trans-actions on Image Processing, 2019. 6

[60] Chuhui Xue, Shijian Lu, and Wei Zhang. Msr: Multi-scaleshape regression for scene text detection. arXiv preprintarXiv:1901.02596, 2019. 6

[61] Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, ShuchangZhou, Weiran He, and Jiajun Liang. East: an efficient andaccurate scene text detector. In Proceedings of the IEEE con-ference on Computer Vision and Pattern Recognition, pages5551–5560, 2017. 2, 6

[62] Siyu Zhu and Richard Zanibbi. A text detection system fornatural scenes with convolutional feature learning and cas-caded classification. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 625–632, 2016. 2

Documents

Towards Unconstrained End-to-End Text Spotting · image scale features, we propose a simple yet effective RoI masking step. Additionally, we show that predictions from an existing