Research: Using text spotting to detect textual information hidden within images hosted on onion domains
Due to the continuous efforts of law enforcement agencies to monitor illegal activities taking place on the Tor network, darknet marketplace vendors have developed novel means for evading the digital forensic tools used to gather evidence of such activities. Specifically, hiding textual content within images can effectively evade text analysis techniques used to monitor content on onion hidden services.
A recently published paper introduced a text spotting framework specifically developed to detect and identify textual content within images hosted on Tor hidden services. Throughout this article, we will take a look at this text spotting framework and its effectiveness in obtaining useful information from onion domains.
The used text detection algorithm:
The chosen text detection algorithm is based on a Connectionist Text Proposal Network, which partitions the image into a number of regions and scores them according to the probability of whether or not they include text. A vertical anchor mechanism is utilized so that these text and non-text regions are considered according to their score against the proposed regions. This technique allows creation of an end-to-end trainable model, yielding a method that examines context information within the image to identify ambiguous text. The Connectionist Text Proposal Network (CTPN) utilizes a convolutional network that renders it possible to input images of variable sizes.
It identifies text lines by using a sliding window within the convolutional feature maps and extracts a sequence of text proposals, which is similar to a regular Region Proposal Network approach. Via the aid of multiple flexible anchor mechanisms, one window is capable of predicting objects in a wide range of aspects and scale ratios. An image’s vertical properties are given a higher weight to minimize the search space grouping sections within this direction, consuming up to ten anchors for each proposal region. The sliding window takes a convolutional feature for yielding the prediction using the VGG16 model. Anchor locations are fixed in order to produce the text scores and filtered with a starting threshold > 0.7. The next processes are based on side-refinement approaches that attempt to predict the offset of each anchor, which focus on the horizontal properties of specific regions. In addition to being computationally efficient, this technique has produced positive results on popular datasets, including the ICDAR 2013 Dataset, with a precision score of 0.93, 0.83 recall and 0.88 F-Measure, as well as a 0.61 score in F-Measure produced with the ICDAR 2015 dataset. It has produced excellent results on the SWT and Multilingual techniques too, with F-Measure scores of over 0.66 and 0.82 respectively.
Authors of the paper proposed an end-to-end trainable neural network, which acts upon sequences of various lengths and creates an effective small model. A Convolutional Recurrent Neural Network (CRNN) is utilized because it can learn directly from labeled sequences and only requires height normalization of the regions during the testing and training stages, yielding highly competitive performance and including fewer parameters when compared to a standard model. The CRNN is comprised of three main components: the convolutional layer, the recurrent layers, and a transcription layer. The convolutional layer obtains a feature sequence from the input image, whereas the recurrent layers act to predict a label distribution for the taken regions.
Figure (1) Text detection algorithm used on an image of a product listing from the drugs category
Finally, the transcription layer generates the final label sequence via utilization of the highest probability included in the region-based transcriptions. The framework of the convolutional layers relies on the VGGVeryDeep architectures, yet with certain modifications being made in order to recognize English words. Since both the deep convolutional and recurrent layers are relatively hard to train, batch normalization is used on certain convolutional layers in order to boost the speed of the training process.
Results of testing the proposed technique:
As there are no existing labeled datasets for various illegal activities taking place on the Tor network, a subset of 100 images from the TOIC dataset were used, so that each category includes 20 images, and then they were labeled via the ICDAR 2017 and COCO-TEXT datasets with manual identification of the bounding boxes and the transcribed text inside the bounding boxes in order test the performance of the proposed text spotting (TS) pipeline. A total of 1112 text regions were obtained from the 100 images analyzed.
In order to adapt the text detection technique for the issue in hand, three different proposals were explored via modification of the default parameters of the original model. These proposals were referred to as P1, P2 and P3. The first proposal, P1, represents the default parameters based on the VGG16 model. The second proposal, P2, involved adjustment of the anchor scales parameter for the text detection algorithm in order to double their original values.
The third proposal, P3, involved increasing the minimum of top scoring boxes before application of nonmaximum-suppression to the images’ region proposals, while also doubling again the original anchor scales parameter from P2. Once the bounding boxes for the detected regions are obtained, text recognition is applied to these areas, while maintaining the default values of the TR model parameters.
Table (1): Performance of the text recognition technique on the three proposals
Table (1) shows that the best results were obtained with the doubling of the anchor scales, outperforming the other two proposals even with the highest minimum threshold. Figure (2) shows observations of the regions identified in all three proposals via adjusting the anchor values and the effect of the threshold. Out of 1112 regions, P1 detected 510 text boxes, while P2 detected 610 bounding boxes correctly. Nevertheless, P3 detected only 300 due to an increase in the vertical threshold and the minimum of top scoring boxes, which had a negative effect on the expected result.
Table 1 and Figure (2) show that the best result is produced via the P2 proposal, emphasizing the significance of the vertical threshold for the anchors which was described initially, which we will then choose to apply text recognition via means of the rest of the text spotting pipeline.
Figure (2): The three scenarios for the text spotting pipeline test
Despite the effectiveness of the text detection algorithm against the orientation, the performance declines markedly when used with images with confusing backgrounds or a higher curve degree. Moreover, some text regions, which are often treated as one unit by the algorithm, can be divided into different sections, which increased the difficulty of the phase of the text recognition and the retrieval of performance measurements including precision or recall due to having to identify how to treat these regions. This special problem is most likely secondary to the algorithm’s vertical approach to text detection. Figure (3) shows that the text detection algorithm yields excellent results on images with partial orientation, yet the results are not good with occlusion due to omitting the number present beneath the gun.
Figure (3): Examples of properly detected text regions, as well as wrong text regions
Both the images of the credit card and dollar bill (Figure 3, top section) involve a significant number of bounding boxes, representing a good reference of the algorithm’s performance, although some of these bounding boxes can be relatively inaccurate, particularly in the case of the magnetic cards (Figure 3, bottom section). When the text recognition phase is considered, it did not produce results as relevant as the previous phase.
The scaling of the images has been shown to significantly affect the results of the text recognition algorithm. In other words, the smaller the regions that include the detected text, the lower the accuracy of text recognition for the region will be. OCR seems to produce relatively similar results, not outperforming the proposed algorithm except in areas associated with artificial text (credit card examples in Figure 3), which is not the main focus of the study.
This paper introduced a new text spotting technique for detecting textual content embedded in images on Tor hidden services. The new technique has proven to be effective in text detection, yet further research is needed to prove that it can still be accurate when used on bigger datasets.