31
Generic Solving of Text-based captchas A seminar report submitted in partial fulfillment of the requirements for the award of the degree of Bachelor of Technology in Computer Science & Engineering Eighth Semester 2011 Admission

Generic Solving Of Text Based Captcha

Embed Size (px)

Citation preview

Page 1: Generic Solving Of Text Based Captcha

Generic Solving of Text-based captchas

A seminar report submitted in partial fulfillment of the requirementsfor the award of the degree of

Bachelor of Technology

in

Computer Science & Engineering

Eighth Semester 2011 Admission

Page 2: Generic Solving Of Text Based Captcha

ABSTRACT

Over the last decade, it has become well-established that a captchas ability to with-stand automated solving lies in the difficulty of segmenting the image into individualcharacters. The standard approach to solve captchas automatically has been a se-quential process wherein a segmentation algorithm splits the image into segmentsthat contain individual characters, followed by a character recognition step that usesmachine learning. While this approach has been effective against particular captchaschemes, its generality is limited by the segmentation step, which is hand-crafted todefeat the distortion at hand.No general algorithm is known for the character collapsing anti-segmentation tech-nique used by most prominent real world captcha schemes. Here a novel approachto solve captchas in a single step that uses machine learning to attack the segmen-tation and the recognition problems simultaneously is formulated. Performing bothoperations jointly allows the algorithm to exploit information and context that isnot available when they are done sequentially. At the same time, it removes the needfor any hand-crafted component, making the approach generalize to new captchaschemes where the previous approach cannot.Many websites use captchas, or Completely Automated Public Turing tests to tellComputers and Humans Apart, to block automated interaction with their sites. Forexample, G mail uses captchas to block access by automated spammers, eBay[12]uses captchas to improve its marketplace by blocking bots from flooding the sitewith scams, and Facebook uses captchas to limit creation of fraudulent profiles usedto spam honest users or cheat at games. The most widely used captcha schemes usecombinations of distorted characters and obfuscation techniques that humans canrecognize but that may be difficult for automated scripts. captchas are sometimescalled reverse Turing tests, because they are intended to allow a computer to deter-mine whether a remote client is human or machine.

Page 3: Generic Solving Of Text Based Captcha

Contents

List of Figures ii

1 Introduction 1

2 Outline 3

3 Motivation 4

4 Approaches and Data Set 54.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5 Algorithm 105.1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5.1.1 Cut - Point Detector . . . . . . . . . . . . . . . . . . . . . . . 115.1.2 Slicer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135.1.3 Scorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.1.4 Arbiter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 155.3 Dealing with Occluding Lines. . . . . . . . . . . . . . . . . . . . . . . 175.4 Sequential Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6 Areas of Improvement 196.1 Learn the KNN weights . . . . . . . . . . . . . . . . . . . . . . . . . 196.2 Improve cut-point elimination . . . . . . . . . . . . . . . . . . . . . . 196.3 Additional Occlusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.4 Explore deep neural networks . . . . . . . . . . . . . . . . . . . . . . 20

7 Future Works of captcha system 21

8 Conclusion 24

Bibliography 25

i

Page 4: Generic Solving Of Text Based Captcha

List of Figures

4.1 Segmentation then Recognition. . . . . . . . . . . . . . . . . . . . . . 64.2 Various Distortion in Negatively Kerned captcha. . . . . . . . . . . . 74.3 15 Best Captchas over which test was conducted. . . . . . . . . . . . 84.4 Captchas over which test was conducted. . . . . . . . . . . . . . . . . 9

5.1 Overview of the algorithm’s four components . . . . . . . . . . . . . . 105.2 How Algorithm Works . . . . . . . . . . . . . . . . . . . . . . . . . . 115.3 Example of the algorithm successfully applied to a Yahoo captcha . . 125.4 Cut Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135.5 Graph creation in Slicer . . . . . . . . . . . . . . . . . . . . . . . . . 145.6 Reinforced Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.7 Occluding Lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.8 Sequential Recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . 18

7.1 Captcha Future. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

ii

Page 5: Generic Solving Of Text Based Captcha

Chapter 1

Introduction

A captcha stand for Completely Automated Public Turing test to tell Computersand Humans Apart [8]is a type of challenge-response test used in computing to de-termine whether or not the user is human. From the abbreviation it was clear thatcaptcha is a Turing test. Turing test is a test a machines intelligence level, to knowwhether the machines intelligence reaches up to the level of humans. captcha wasfound in 2000 by Luis von Ahn, Manuel Blum and Nicholas J. Hopper of CarnegieMellon University and John Langford of IBM[14]. We know that captcha is used todetermine whether user is machine or human but this was old concept. With theimplementation of the captcha breaking system as mentioned in the reference papermost of the captchas can be decode by the algorithm designed by the authors. So inorder to undergo Turing Test captcha is old fashion now stronger method or captchahas to be designed. This captcha breaking system can break most of the captcha invarious web application systemsThe standard approach to solve captcha automatically (i.e. by a computational de-vice) is by sequential processing. This Sequential processing consists of two majorfunctions they are Segmentation [32] and Recognition. The segmentation algorithmsplits the image into segments that contain individual characters. The recognitionalgorithm uses machine learning to recognize a single character. After recogniz-ing all the character the machine can generate the perfect decoded format of thecaptcha. After segmentation of captcha, Recognition is performed that is why itis called sequential processing. This approach is effective only against a particularset of captcha schemes. In some captcha schemes the sequential processing willfail at segmentation step. These exceptional captcha schemes follow hand-craftedto technique. There is no general algorithm known for the character segmentationprocess for hand real world captcha schemes. Due to this drawback the traditionalsequential processing failed.

Here the discussion is about the algorithm which is not sequential but simultane-ous processing. That is the two major functions Segmentation and Recognition areexecuted simultaneously over the captcha. Performing both operations jointly allowsthe algorithm to get full information for machine learning and context which wasnot available when sequential Algorithm was used. It also removes the hand-craftedschemes, making this approach the generalized approach to new captcha schemeswhere the previous approach cannot.Many websites use captchas, Gmail uses captchas[7] to block spam access, eBay uses

1

Page 6: Generic Solving Of Text Based Captcha

Generic Solving of Text-based captchas

captchas to prevent flooding into the site with scams, and Facebook uses captchasto limit creation of system based fake profile generation. users or cheat at games.The most widely used captcha schemes use combinations of distorted characters andobfuscation techniques that humans can recognize but that may be difficult for au-tomated scripts. Captchas are sometimes called reverse Turing tests, because theyare intended to allow a computer to determine whether a remote client is human ormachine[15]. The effectiveness and universality of the results suggests that combin-ing segmentation and recognition is the next evolution of automated captcha solving,and can suppress the sequential approach used in earlier works. After comparingthe accuracy of the algorithm with the accuracy of humans it was found that purelytext based captchas[16] may be nearing their end, and provides early steps towardrethinking how reverse Turing tests can be performed securely.

Dept. of Computer Science & Engg. 2 SIMAT, Vavanoor

Page 7: Generic Solving Of Text Based Captcha

Chapter 2

Outline

This report is presenting the entirely new concept of breaking a text based captcha.It is know that breaking a captcha using an algorithm is not a good deed. But themain highlight of this report is to make the computer analysis and research peopleunderstand the approach being used here to break the captcha and design a strongerand more complex reverse Turing test. This algorithm is only concentrating on solv-ing a text based captcha.

The First chapter is about describing out an introduction about the proposedalgorithm which does the task of solving a text based captcha. This chapter alsodescribes about the introductory parts of this new algorithm along with the brief-ings about the inventors of the captcha and the basic informations needed to knowabout the captcha. The Third Chapter describes about the relevance and motiva-tion behind the study of this algorithm. The Chapter Approaches and Data Setstates about the various methods to implement this algorithm various importantapproaches to solve a captcha. The Data Set lists out the various available andthe most famous list of text based captchas. In the Fifth chapter the algorithm isdefined and illustrated in detail with all the substitution process being used in thealgorithm. The Sixth chapter details about the future works being made to impro-vise the reverse Turing test process. This chapter also states about the various newcaptcha schemes which can replace the traditional text based captcha system. Thefinal chapter is the conclusion where a summarized format of the entire report isdescribed. The conclusion chapter is followed by the Bibliography.

3

Page 8: Generic Solving Of Text Based Captcha

Chapter 3

Motivation

The main motivation of this method is to bring forward a new concept to solve thecaptcha. The captchas are believed to be not solvable by the computing machinebut here the authors of the main reference paper has proposed a new algorithm tosolve the captcha. The authors themselves are mentioning that they are publishingthis algorithm for the advancement of technology in the field of Reverse Turing Testand for the academic research purpose. The algorithm is complex and more costlyto reproduce than employing cheap manual labor to solve captchas. Due to thehigher accuracy rate and effective functionality in solving the captcha using thisalgorithm, leads the designers and the research specialist to invent new complexand more captcha system so that the security level and can be more enhanced inthe field of Computer Science.

4

Page 9: Generic Solving Of Text Based Captcha

Chapter 4

Approaches and Data Set

4.1 Approach

As mentioned earlier in the Introduction part, that this algorithm is only applicablefor text based captcha as a result the discussion of the topic is only related to thetext based captcha system. The text based captchas[13] are treated as an image.As the captcha is an image the various image processings techniques are used to un-dergone. In this section we will discuss the various approaches made in past and theapproach made to implement automated captcha solving system and its limitations.

In order to implement the automated captcha the entire process of automationconsist of two main process they are:

• Segmentation - Segmentation is the process of partitioning a digital imageinto multiple segments (sets of pixels, also known as superpixels). The goalof segmentation is to simplify and/or change the representation of an imageinto something that is more meaningful and easier to analyze.Image segmen-tation is typically used to locate objects and boundaries (lines, curves, etc.)in images. Here this process is used for dividing the characters of the captchain to different individual characters is called segmentation. This process isthe most difficult and complex to design. It also uses the concepts of Imageprocessing as the captcha is an image. Practically it is said that there is noeffective algorithm which does the process of segmentation accurately.

• Recognition - Recognition is a field that includes methods for acquiring, pro-cessing, analyzing, and understanding images and, in general, high-dimensionaldata from the real world in order to produce numerical or symbolic infor-mation. The process of recognizing each distorted individual character (seg-mented character) with the help of machine learning is called recognition.The concept of machine learning is used in recognition because it was foundthat the machine learning algorithms consistently outperform humans for sin-gle character recognition. Due to the presence of the Image recognition thealgorithm becomes more smarter and intelligent

5

Page 10: Generic Solving Of Text Based Captcha

Generic Solving of Text-based captchas

Figure 4.1: Segmentation then Recognition.

There are generally two main methods to implement automated captcha systemthey are:

• Segment then Recognize Method

• Segment and Recognize Together Method

The Figure:4.1 shows the various process involved in the process of decoding orbreaking a text based captcha scheme. The process of solving the automated captchacan be divided into five generic steps: pre - processing, segmentation, post - seg-mentation, recognition, and post-processing. While segmentation, the separationof a sequence of characters into individual characters, and recognition, identifyingthose characters, are intuitive and generally understood, there are good reasons forconsidering the additional pre - processing and post-processing steps as part of astandard process. For example, preprocessing can remove background patterns oreliminate other additions to the image that could interfere with segmentation, whilepost - segmentation steps can clean up the segmentation output by normalizing thesize of each image or otherwise performing steps distinct from segmentation. Afterrecognition, post-processing can improve accuracy by, for example, applying spellchecking to any captcha that is based on actual words (such as slashdot).Based onthis generic captcha - solving architecture, test experimentswith various specific al-gorithms were tried on various popular website captchas. From these set of analysisreport,a set of techniques was identified that make captchas more difficult to solveautomatically. By varying these techniques, a larger set of techniques were createdthat helped the study, the effect of each of these features in detail and refines theautomated attack methods. humans.

The Segment then Recognize Method is the traditional method in which theprocess of segmentation is done first and then the segmented characters are passedinto the recognition algorithm which uses the concept of machine learning to rec-ognize each character. This approach has been effective against particular captchaschemes, its complexity in solving is deviated due to the segmentation step, whichis hand - crafted to defeat the distortion at hand. No general algorithm is known forthe character collapsing anti-segmentation technique used by most prominent realworld captcha schemes. This technique is called negative kerning which is a variantof the object occlusion problem.

Dept. of Computer Science & Engg. 6 SIMAT, Vavanoor

Page 11: Generic Solving Of Text Based Captcha

Generic Solving of Text-based captchas

(a) captcha with no noiseand distortion

(b) captcha with cutthrough lines

(c) captcha with color anddistortion

Figure 4.2: Various Distortion in Negatively Kerned captcha.

Negative Kerning (Figure:4.2) is a character collapsing technique in which thespace between the characters are removed and each characters are occluded withthe neighboring character. The process of occluding means to joint or to attach thecharacters. Along with the process of occluding characters in the negative kerningprocess some extra noises, distortion and randomization are also added to preventside channel attack. This adding up of noises is in the form of adding colors, dis-torted cutting through text lines in order to make the captcha more complex. Sidechannel attack is the process of recognizing the captcha content from the processof continuous learning of each character in captcha and predict the result. Whennoises are added up in the Negatively Kerned captcha then it will be difficult forundergoing side channel attack[19].

The Figure:4.2a shows the captcha which undergone negative kerning but therewas no noise[11], no occluding lines and no external distortions are added up. TheFigure:4.2b shows the captcha on which negative kerning and occluding lines areadded up to make more distortions and causing confusion. The Figure:4.2c shows asimple captcha with distortions in the form of color, occluding lines and dots, hereno negative kerning is implemented. Negative kerning is considered the most securemethod for preventing segmentation because it has successfully withstood years ofattacks. Almost all of the most prominently used captcha schemes rely on it. Theother method of choice to prevent segmentation, which seems to have fallen outof fashion after a successful wave of attacks, is to use occluding lines. A captchasability to withstand automated solving lies in the difficulty of segmenting the imageinto individual characters rather than recognizing the characters themselves. Whichmeans that the segmentation process is the difficult part in the automated captchasystem. Up till now there have been two approaches/works which have been formu-lated for automated captcha solving:

The first type of attack is to undergo side channel attack for all type of captcha.In this method the segmentation algorithm will does the task of dividing the captchainto different characters along with the consecutive distortion faced by the particularcharacter. The machine learning part will then try to remove the distortion or pre-dict the character and generate the output. But this approach is not much favorablebecause the defender can easily defend the captcha by making the captcha difficultfor segmentation and if the segmentation was carried out then also the output willnot be proper. So as a result this attack approach cannot be applied over all thecaptchas.The second type of attack focuses on finding weaknesses in the distortion algo-rithms of particular captcha schemes. A specially designed segmentation algorithm

Dept. of Computer Science & Engg. 7 SIMAT, Vavanoor

Page 12: Generic Solving Of Text Based Captcha

Generic Solving of Text-based captchas

is designed by the attacker which works on the principle of image processing andmorphological segmentation. This is used to remove the distortion. The image pro-cessing algorithm is provided with the features to recognize the twisted and turns inthe captcha text. So based on the twist and turns in the captcha the image process-ing algorithm will undergo the process to make it into understandable format. Themorphological segmentation does the task of filling the missing data based on therelevant information obtained from the image processing part. While this attack wasalso a failure. This approach was only applicable over the reCaptcha 2011 scheme.Later in 2013, a group or researchers examined hollow captcha, specifically and wereable to solve all of the captcha schemes by extending the segmentation process thenrecognize approach that involves nine consecutive steps.

Up till now research in captcha solving has followed the exploit - patch cycle.The exploit - patch method was tried on the best 15 captcha scheme shown inFigure:4.3 In the exploit - patch cycle the attacker finds a flaw in a particularanti - segmentation technique, and then the defender tries to patch it, the processof removing the flaw int the anti - segmentation technique or moves on to a newone. The limitation of the segment then recognize approach has been the attacker’sability to find new flaws. This proposed algorithm can overcomes this limitation bysegmenting and recognizing the captcha simultaneously, thus removing the need formanually discovered heuristics to segment captchas [1].

Figure 4.3: 15 Best Captchas over which test was conducted.

Dept. of Computer Science & Engg. 8 SIMAT, Vavanoor

Page 13: Generic Solving Of Text Based Captcha

Generic Solving of Text-based captchas

4.2 Dataset

In this section the list of various famous captcha which is used widely are described.The Algorithm as per designed by the authors evaluated the efficiency and complex-ity of their algorithm based on the Dataset(Figure:4.4) given here.

Figure 4.4: Captchas over which test was conducted.

It was found that most of the captcha schemes out of these six captcha schemesare depended on negative kerning to prevent segmentation. Since 2011, some ofthose schemes, namely Baidu and ReCaptcha, have evolved. To keep the algorithmevaluation relevant to the state of the art in captcha design, we extended our corpusto include the new versions of Baidu and ReCaptcha in 2013. Corpus is a large andstructured set of texts used to do statistical analysis and hypothesis testing, checkingoccurrences or validating linguistic rules within a specific language territory. Herethe language is the captcha. As visible in figure Baidu and ReCaptcha evolved (theupdated version is the 2013 version) in two radically different ways: Baidu decidedto use hollow letters whereas ReCaptcha introduced more aggressive distortions.But the success rate decreased on the new version of ReCaptcha compared to theprevious version. On the other hand, surprisingly, its accuracy significantly increasedon the newer version of Baidu.

Dept. of Computer Science & Engg. 9 SIMAT, Vavanoor

Page 14: Generic Solving Of Text Based Captcha

Chapter 5

Algorithm

In this chapter an overview of our algorithm is clearly mentioned along with thedescription of its major components. As mentioned early the process of learning forthe process of recognition is a very important part of this algorithm. The processof reinforcement learning process which is the main reason for the accuracy of thisalgorithm. In the previous chapter it was described that not only negative kern-ing but occluding lines are also used for the process of creating captcha so solvingthe occluding lines in a generic manner since it is a natural extension of our algo-rithm. The discussion about optimizations and trade-offs that can be applied to thealgorithm will be done in the next chapter.

5.1 Algorithm Overview

As mentioned early this algorithm works on the process of undergoing the processof segmentation and recognition together. So here the first thing to do is to findall possible ways to segment the given captcha. to find the set of all possible waysto segment the captcha it means that to find the set of methods to segment eachcharacters of the captcha. An unstructured captcha image can be segmented into dif-ferent forms. After getting the set of segmented captcha, decide which combinationis most likely to be the correct one. After analyzing all the possible segmentationpaths find out the path(set of segments) that can maximize the recognition rate.We contrast this with the segment then recognize approach where an uninformedsegmentation algorithm passes at most a small number of possible segmentations toan independent recognition algorithm as a result it is more time consuming and willmay or may not generate correct result after many attempts.

Figure 5.1: Overview of the algorithm’s four components

In the Figure:5.1 itself it is very clear that this Algorithm consist of four maincomponents:

10

Page 15: Generic Solving Of Text Based Captcha

Generic Solving of Text-based captchas

• Cut - Point Detector: This will find all the potential ways to segment thecaptcha.

• Slicer: It will extract the segments and combine them into a graph.

• Scorer: It will perform Optical Character Recognition (OCR) on the segmentsand assigns a recognition confidence score to each one of them.

• Arbiter: It is responsible for processing the scores and determining what arethe most likely letters.

Figure 5.2: How Algorithm Works

The graph representation in the Figure: 5.2 is used to find and store all possiblesegmentations which can be derived out from the captcha at once. Due to thisprocedure the algorithm was successful to simultaneously solve the segmentationand recognition problems. Now we will discuss each components working in detail.

5.1.1 Cut - Point Detector

The Cut - Point Detector is the first and the initial step in this algorithm. Here asmentioned earlier will generate the set of segmentation area. Image segmentationis the process of partitioning a digital image into multiple segments (sets of pixels,also known as superpixels). The goal of segmentation is to simplify and/or changethe representation of an image into something that is more meaningful and easierto analyze. Image segmentation is typically used to locate objects and boundaries(lines, curves, etc.) in images. More precisely, image segmentation is the processof assigning a label to every pixel in an image such that pixels with the same labelshare certain characteristics.

Dept. of Computer Science & Engg. 11 SIMAT, Vavanoor

Page 16: Generic Solving Of Text Based Captcha

Generic Solving of Text-based captchas

So in order to undergo the segmentation we need two points(pixels), to constructa line segment, for segmentation. This two points can be obtained by characters ex-amining the second derivative of the curve generated by following the bottom pixelsof the captcha, and the curve generated by following the top pixels of the captcha.Now we got two set of curves, first curve is the second derivative of the top pixels ofcaptcha and the second curve is the second derivative of the bottom pixels of captcha.

Now in both the curves we have to mark the points for undergoing segmentation.These points can be obtained by finding the Inflection points [31] on the curve. Theinflection point is a point on a curve at which the curve changes from being concave(concave downward) to convex (concave upward), or vice versa. So we will get aset of inflection points on the first curve and the second curve. The set of inflectionpoints on the first curve is marked as red color and the set of inflection points inbottom is marked in blue color.

Now after getting the points to generate the cuts the process of finding all possiblecut lines is initiated. Now each cut is constructed by connecting the inflection points- one from the top, and one from the bottom. On doing this process for the entirecurve we will get a set of cuts called as the Potential Cut. And this Potential Cutsare marked over the captcha. Now this captcha containing the Potential cuts aregiven as input to the Slicer.

Figure 5.3: Example of the algorithm successfully applied to a Yahoo captcha

Dept. of Computer Science & Engg. 12 SIMAT, Vavanoor

Page 17: Generic Solving Of Text Based Captcha

Generic Solving of Text-based captchas

5.1.2 Slicer

The Slicer is provided with a captcha containing potential cuts marked over it asinput. The slicer applies some heuristics [30] to extract the meaningful potentialsegments based on the cut points and builds the graph as shown in Figure: 5.4. Apotential segment is considered meaningful if the two cuts that define its left andright boundaries are sufficiently far apart, yet not too far apart. The process offormation of graph (Figure:5.5) is a real unique process. In this algorithm the entiretext based captcha is divided into a set of window consisting of 2 characters at atime. So as shown in the Fig. we can see the Characters A and B are the two char-acters of the captcha and resides in the window. This window consist of 4 potentialcuts which was found in the previous step.Now the process of taking the content in the segments region is initiated. Initiallythe algorithm takes the region within the segment 0 and 1, where 0 and 5 are theborders of the window treated as a segment. The content is analyzed and a weightis assigned to the recognized part along with the possible character. Moving to thenext cut i.e the region between the segments 1 and 2, this region doesn’t gives ameaningful segment as a result it is discarded. After getting all the values a graphis traversed with the segments as the nodes and the vertices’s are with the possiblecharacter and the weight assigned. Based on this graph the best cut can be foundhere cut 3 because region from 0 to 3 and 3 to 5 gives nearly same character andsame weight value In simple terms it means that a cut or segment is said to be po-tential if the distance towards the black pixels from the segment pixel is sufficientlyfar yet not too far.

Figure 5.4: Cut Optimization

So naturally if the number of potential cuts increases the computation time alsoincreases. It was found that using this algorithm it took 9 hours to undergo Slicingover a captcha containing 12 characters. So to remove this draw back the onlyremedy is to decrease the number of cuts in the Potential Cut set. So for optimizingthis algorithm a new approach is formulated which works by pruning(removing)near- duplicate and improbable cuts from the set of potential cut points.First, weremoved all the cuts that have an angle > 30. Then we examined the ratio ofwhite pixels to black pixels to eliminate cut lines that pass through too many blackpixels, since they are most likely cutting through the middle of a letter. Finally we

Dept. of Computer Science & Engg. 13 SIMAT, Vavanoor

Page 18: Generic Solving Of Text Based Captcha

Generic Solving of Text-based captchas

Figure 5.5: Graph creation in Slicer

compared the pixel intensities of the left and right boundaries to estimate whetherthe cut marks a transition between two letters.

5.1.3 Scorer

The scorer does the task of assigning a score value for each character which got seg-mented. The scorer traverse the graph of potential segments and applies OCR andthen assign a confidence value. In the previous step the algorithm generates per-fectly segmented character they are called the potential segments. Now the Scorerwill scan or analyze these segments and generate a score for the character. Aftergenerating the score using KNN algorithm the class to which the potential segmentbelongs is found out.

The KNN algorithm [29] known as k Nearest Neighbor algorithm. This algo-rithm is an classification algorithm, it works on the principle of making an elementbelongs to a class from the set of classes by measuring the distance k from each classin the feature space. But here modified version of KNN algorithm is used.

First after getting the potential segments in the captcha, at pixel level the scorevalue is calculated for the corresponding character and then based on the overallconfidence value it assigns a recognition confidence score. This recognition confi-dence score is used as the source for the KNN algorithm and its value is checkedwith the surrounding class values. The class contain set of similar characters, like forexample A,4 belong to same class similarly 0,O belong to same class. The same classelements have a nearly same value because their character appearance are alike. Soallocating each captcha value to a particular class is the prime task done here.

Segments are processed at the pixel level, as this has been demonstrated to bethe best approach for text recognition. Here the KNN algorithm is more preferablebecause of the following factors : computation at pixel level, noise resistance andcomputational speed. The noise resistance arises from using a relatively small k(less than 10) in our KNN to identify the nearest neighbors. This is essential in ourcase because most of the potential segments generated by the slicer are meaninglessand belong to the garbage class.

Dept. of Computer Science & Engg. 14 SIMAT, Vavanoor

Page 19: Generic Solving Of Text Based Captcha

Generic Solving of Text-based captchas

A metric distance function is a function that defines a distance between elementsof a set. It was realized that the problem was assigning an equal weight to each pixelregardless of its position in the segment or its gray scale value. It turns out thatpixels on the edge of segments are less meaningful than pixels in the center preciselybecause they are shared between characters that have been collapsed together. Weachieved very good results on all captcha schemes by assigning higher weight topixels nearer the center of the segment, and to darker ones.

5.1.4 Arbiter

The arbiter is the final component of this algorithm. The arbiter does the taskof taking in the input from the Scorer, that is the recognition confidence score ,and will check it with the trained accurate values of the class members and thengenerate the solution of that segment. The scorer will generate the result that thegiven segment of character will belong to which class. Now the arbiter does the jobof accessing all the data in that particular class and analyzing each one with thesegmented character.

For this method here the approach of ensemble learning approach is used. TheEnsemble learning [27] is a technique for combining many weak learners in anattempt to produce a strong learner. The term ensemble is usually reserved formethods that generate multiple hypotheses using the same base learner. So theclass will contain all set of segmented character set on the basis of their confidencescore. So based on the requirement the algorithm’s approach will deeply study theinput and tries its maximum to reach to the solution.

5.2 Reinforcement Learning

Reinforcement Learning [28] is an important part of this algorithm. Reinforcementlearning is an area of machine learning inspired by behaviorist psychology, con-cerned with how software agents can probably take actions in an environment so asto maximize reward. So as a result we can say that reinforcement learning can alsomake the algorithm more smarter and clever. Reinforcement learning is based onthe concept of ”making understand first and then react”, so like wise here also wecan use this approach in the algorithm too.

The traditional way to train a character classifier is to provide a set of labeledcaptchas which are already segmented and then let the classifier learn to recognizeeach character from those segments using the labels. So here providing the labeledsegmented captcha is an intelligent process.In this process it is assumed that theclassifier is given with the correct number of segments - one for each letter in thecaptcha. Now based on this labeled capthca the algorithm will learn the variousapproaches or schemes used in the traditional text based captcha and then it willgenerate the solution. In the ”segment then recognize” approach, this assumptionholds because the segmentation is handled by a vision algorithm that is not part of

Dept. of Computer Science & Engg. 15 SIMAT, Vavanoor

Page 20: Generic Solving Of Text Based Captcha

Generic Solving of Text-based captchas

(a) captcha error recognition(b) captcha’s bad segmentation and badrecognition

Figure 5.6: Reinforced Learning.

the classifier itself. Which states that the segmentation algorithm and the recogniz-ing algorithm are two different independent algorithms.

Here this algorithm also uses reinforcement learning approach where the humanalso plays a part. Instead of providing the classifier with labeled examples of validsegments,here the algorithm asks the human to give explanation to the segmentsthat have been misclassified, and then the algorithm learns from the feedback. Thetraining is started using the traditional method, first the algorithm processes a setof labeled captchas. During the decoding of this set of labeled some set of captchasfails. So those set of captchas are stored/saved. Those captchas that were notsuccessfully recognized, the failed captchas, the algorithm asks for human feedbackwhen a segment surrounded by two correctly classified segments is misclassified. Inthose cases, the algorithm needs the human expertise because the misclassificationcould be due either to improper segmentation, or to bad recognition.If the errorwas due to improper segmentation, the segment is discarded. If the error was dueto a recognition error, the segment is added to the classifier training set. Whenall the cases are reviewed, the algorithm is retrained with the enriched dataset. Inpractice, even a single round of reinforcement learning is enough to significantlyimprove accuracy.

Dept. of Computer Science & Engg. 16 SIMAT, Vavanoor

Page 21: Generic Solving Of Text Based Captcha

Generic Solving of Text-based captchas

5.3 Dealing with Occluding Lines.

To make captcha more difficult to decode/break for any algorithm, the captcha de-signers are using the concept of occluding lines. Occluding lines are those whichare unwanted source of lines, to deviate the functionality of the captcha breakingalgorithms. This algorithm was initially not able to solve the problem of occludinglines but later it was solved.

The initial attempt was to introduce a new algorithm for the removal of inde-pendent lines of the algorithm based on the soft margin algorithm.

Later two more simpler method was formulated. The first method was to add anew class in the scorer part of the algorithm. As the scorer works more on the KNNalgorithm and due to the property of permitting discontinuous character classes anew class was added up. This new class is a collection of many different shapes ofline. The second method is that as mentioned earlier the cut point detector workson the concept of the finding the second derivative of the curve and then finding theinflection points to find the potential cuts, well this part is suited for ignoring flatparts.

Figure 5.7: Occluding Lines.

5.4 Sequential Recognition

For every algorithm one of the prime factor that needs to be considered all timeis the computational cost. The computational cost shows the performance of the

Dept. of Computer Science & Engg. 17 SIMAT, Vavanoor

Page 22: Generic Solving Of Text Based Captcha

Generic Solving of Text-based captchas

algorithm for any kind of related inputs. In the previous section we realized that thecomputational cost of the algorithm increases with the increasing size of the captcha.This computational cost increases because the number of characters to be segmentedand the number of character to be recognized is increased as a result the compu-tation cost increases. This is a serious issue while looking into the factor of efficiency.

Now this problem is solved in a very interesting manner. A separate variation ofthe current algorithm is designed here, which increases the efficiency of the characterrecognizability process. Here local recognition algorithm is used. Local recognitionis a sub process of the main recognition process. This process is done by implement-ing the approach of making a local decision in which a window is selected whichintakes two letters at a time. Considering two characters at a time yielded signif-icantly better results than looking at one or three characters at a time. Now in awindow the number of all possible cuts are scanned. Suppose if there are 3 cuts in awindow of two characters. First step is to consider any one of the 3 cuts and undergosegmentation. After segmentation we will get a result i.e two separated characters.Now these separated character’s pixel value is calculated and stored. Now repeatthis process for all the types of cut which is present in the window. After thatthe maximum pixel value is calculated and is selected as the recognition confidencescore. But there are chances that this process of the local decision is subjected tomany errors.

Figure 5.8: Occluding Lines.

The main areas of error in this method is that if the characters of the capthcasare highly conjuncted to each other then the chances of creating the proper windowwill become least accurate and the algorithm will give an entirely wrong output.It was also found that some of the captcha schemes are oriented into one specificdirection i.e either left or right. This local decision process is more effective on theleft side based captcha over the right side based captcha.

Th best solution to this issue is to undergo sequential recognition from both direc-tions and then combining the two recognition scores to improve the overall accuracy.This is called left - right approach. This is done by executing two local decision pro-cess simultaneously. One local decision process doe the process of recognition fromleft to the right and the other one does the same process from right to left. Aftersuccessfully completion of both the process the result is combined together to getthe best optimal score value.

Dept. of Computer Science & Engg. 18 SIMAT, Vavanoor

Page 23: Generic Solving Of Text Based Captcha

Chapter 6

Areas of Improvement

The segmentation and recognition simultaneously approach, holistic [9] is beingformulated for the first time here. Though this algorithm produces good results, itis just the first rough implementation.This chapter describes about the some of themost promising directions for improvement.

6.1 Learn the KNN weights

During the algorithm it was discussed that the kNN algorithm is used to classify thecharacters. The current implementation uses a single manually chosen set of weightsfor the KNN distance computation that performed well on the set of captcha schemeprovided initially . It is believed that automatically learning of those weights for eachcaptcha scheme would improve accuracy, particularly for schemes that use unusualfonts or specific distortions. It is believed that it is possible to accomplish this fullyunsupervised, similar to the cut-point detector and slicer phases of our algorithm.

6.2 Improve cut-point elimination

The computation time is directly related to the number of potential segments, thefirst optimization was to come up with heuristics to reduce the number of cut pointsconsidered by the cut point detector. This optimization works by pruning near-duplicate and improbable cuts from the set of potential cut points. First, removeall the cuts that have an angle >30. Then examine the ratio of white pixels to blackpixels to eliminate cut lines that pass through too many black pixels, since theyare most likely cutting through the middle of a letter. Finally we compared thepixel intensities of the left and right boundaries to estimate whether the cut marksa transition between two letters. Finding a better set of heuristics that are bothgeneric and more precise is an open question.

6.3 Additional Occlusion

As pointed out earlier, Baidu and CNN captcha schemes use occluding lines with lowcurvature. While results on these captcha schemes are very good and ur algorithmproperly detects lines, future work should investigate in depth how various types of

19

Page 24: Generic Solving Of Text Based Captcha

Generic Solving of Text-based captchas

lines, e.g., sine waves that have a high curvature, impact the recognition rate. Itshould also consider other types of occlusion, e.g., blobs. To date, we have not foundreal world captcha schemes that employ this type of occlusion; perhaps occlusion ofthis type presents usability challenges that make it impractical for humans.

6.4 Explore deep neural networks

A primary contribution of this work is to completely demonstrate the effectivenessof performing segmentation and recognition simultaneously. Accordingly, a consid-eration was also made on other algorithms that are able to process captchas in aholistic manner. In particular, with collaborators, a experiment was conducted toexperiment with deep convolution neural networks, similar to those in [17]. Theseexperiments have confirmed the benefits of a unified approach, and have achievedcaptcha-solving results that equal or improve upon those presented in this paper.For certain ReCaptcha data sets, these new results show such dramatic improvementin accuracy, while using large-scale training sets, that they suggest that deep neu-ral networks may hold a substantial advantage over humans for solving text-basedcaptchas [18].

Dept. of Computer Science & Engg. 20 SIMAT, Vavanoor

Page 25: Generic Solving Of Text Based Captcha

Chapter 7

Future Works of captcha system

With the demonstration (through research publications) that character recognitionCAPTCHAs are vulnerable to computer vision based attacks, some researchers haveproposed alternatives to character recognition, in the form of image recognitionCAPTCHAs which require users to identify simple objects in the images presented.The argument is that object recognition is typically considered a more challengingproblem than character recognition, due to the limited domain of characters anddigits in the English alphabet. This is the reason why captcha is taken as an imagerather than a set of characters. When captchas were invented, the designers real-ized that with the passage of time one of two things would happen: either captchaswould remain an invaluable way to differentiate humans and computers, or very highquality OCR would become readily available.

Here the entire description was based on solving the text based captcha. Andit was found that the end of using text based captcha has approached as it is quitesimple to decode. In this algorithm by using the concept of segmentation and recog-nition together many of the captcha were able to decode successfully. So it is directthat in near future, by updating this algorithm one can achieve 100% decoding ofthe text based captcha. Due to all such reason the need for new type of reverseTuring test is higher.The first potential method is simply to find a more difficult problem in computervision. Incorporating video or requiring the user to perform a higher order cog-nitive task such as circling or rotating an object. Due to the failure of the textbased captcha the new captcha schemes that arrived where the audio and the videocaptcha. But the audio captcha resulted into a failure as it was able to decryptedusing an output of speech to text recognizer. The video captcha can also be decodedsuccessfully by taking the frame pictures of the video and then analyze each frameand decode the captcha.

Datta et al. published a paper in the ACM Multimedia ’05 Conference, namedIMAGINATION (IMAge Generation for INternet AuthenticaTION), proposing asystematic way to image recognition captchas.According to that paper a set of im-ages are distorted in such a way that state-of-the-art image recognition approachwill fail to recognize them. But this captcha was able to be solved with quite diffi-culty by the humans.[20]

21

Page 26: Generic Solving Of Text Based Captcha

Generic Solving of Text-based captchas

Microsoft have developed Animal Species Image Recognition for Restricting Ac-cess (ASIRRA) which ask users to distinguish cats from dogs. Microsoft had a betaversion of this for websites to use.?? Microsoft claim ”Asirra is easy for users; itcan be solved by humans 99.6% of the time in under 30 seconds. Anecdotally, usersseemed to find the experience of using Asirra much more enjoyable than a text-basedCAPTCHA.” This solution was described in a 2007 paper to Proceedings of 14thACM Conference on Computer and Communications Security (CCSIts). However,this project was closed in October 2014 and is no longer available. Asirra captcha(Figure:7.1a), which asked users to distinguish between cats and dogs. Less than ayear after its release it was successfully broken using a classifier trained to recognizeimage textures.

The MintEye captcha (Figure:7.1b) scheme was a moder version of captchascheme where a image is distorted and the user has to undistorted the image andmake it back as a perfect figure[6].This schema relies on undistorted an image wasbroken by a very simple attack based on Sobel operators that only required 23 linesof Python[5]. Due to this this schema was also rejectedMitra et. al. have suggested using emergent images as an alternative way to encode

(a) Asirra captcha (b) MintEye captcha

Figure 7.1: Captcha Future.

information in video that might be robust against computer vision algorithms. Ashort post on emergent images, still or moving images where objects at first onlyappear with effort and concentration, but once recognized are very easy to see againeven after several months or years. In effect once a user have recognized the objecthe/she remember it forever[23]. Emergence refers to the unique human ability toaggregate information from seemingly meaningless pieces, and to perceive a wholethat is meaningful.

Recently game - based captchas have been developed[4]. However implementingthis idea as proven to be difficult, as the game captcha schemes for the leading gamecaptcha provider Are you a human have been broken[22]. This captcha system workson the concept of giving the user a game to complete and reach the target goal. Thegame is a simple design and only humans can solve it.However implementing thisidea as proven to be more difficult.NuCaptcha is an early fraud detection service which utilities behavior analytics toprovision threat appropriate, animated video captcha. NuCaptcha is developed andoperated by Canadian-based firm, NuData Security. Static image-based captchasare routinely used to prevent automated sign-ups to websites by using text or im-

Dept. of Computer Science & Engg. 22 SIMAT, Vavanoor

Page 27: Generic Solving Of Text Based Captcha

Generic Solving of Text-based captchas

ages of words disguised so that optical character recognition (OCR) software hastrouble reading them[10]. However, in common captcha systems, users often fail tocorrectly solve the captcha 7% - 25% of the time.NuCaptcha uses animated videotechnology that it claims make puzzles easier for humans to solve, but harder forbots and hackers to decipher[24].

Cognitive Behavior : Another method, to compute whether the user is a humanor system is on the basis of the computation speed of the respective brains, i.e speedof brain of human in solving to the ratio of the speed in solving by a computer isrelatively faster[25]. So based on this time variation in solving the captcha one canrecognize who is human and who is a system.

Leveraging reputation: In addition to considering how a reverse Turing test issolved, captcha providers could consider the identity of the solver, for example theIP address, the geographic location, etc. If a good enough proof of identity can beestablished, providers can use this reputation to adapt the difficulty of the reverseTuring test.

Dept. of Computer Science & Engg. 23 SIMAT, Vavanoor

Page 28: Generic Solving Of Text Based Captcha

Chapter 8

Conclusion

Here a detail explanation and study was made on the approach to solve captchas in asingle step that uses machine learning to attack the segmentation and the recognitionproblems simultaneously. Performing both operations jointly allows this algorithmto exploit information and context which is not available when it is done sequentially.

This algorithm was able to solve many prominent real-world captcha schemesthat use both negative kerning and occluding lines without any modification tothe algorithm. The algorithm was able to achieve a 38.68% recognition rate onBaidu 2011, 55.22% on Baidu 2013, 51.09% on CNN, 51.39% on eBay, 22.67% onReCaptcha 2011, 22.34% on ReCaptcha 2013,28.29% on Wikipedia, and 5.33% onYahoo.

This study of algorithm gives us realization that the reverse Turing tests mightbe improved going forward.The effectiveness and universality of the results suggeststhat combining segmentation and recognition is the next evolution of catpcha solv-ing, and that it supersedes the sequential approach used in earlier works. With theseadvances, it seems that purely text-based captchas are likely to have declining util-ity; significant effort may be needed to rethink the way we perform reverse Turingtests.

24

Page 29: Generic Solving Of Text Based Captcha

Bibliography

[1] Elie Bursztein, Jonathan Aigrain, Angelika Moscicki, John C. Mitchell,”TheEnd is Nigh: Generic Solving of Text-based CAPTCHAs”’,2013.

[2] P. Golle. Machine learning attacks against the asirra captcha. In ACM CCS2008, 2008.

[3] R. Gossweiler, M. Kamvar, and S. Baluja. Whats up captcha? a captcha basedon image orientation. In World Wide Web, 2009.

[4] Are you human ? http://areyouahuman.com/.

[5] Breaking the minteye image captcha in 23 lines of python. Blog posthttp://www.jwandrews.co.uk/2013/01/breakingthe-minteye-image-%captcha-in-23-lines-of-python.

[6] Minteye captcha. website: http://www.minteye.com/, 2013.

[7] A. S. E. Ahmad, J. Yan, and M. Tayara. The robustness%of google captchas.Technical report, Newcastle University, 2011.

[8] E. Athanasopoulos and S. Antonatos. Enhanced captchas: Using animationto tell humans and computers apart. In IFIP International Federation forInformation Processing, 2006.

[9] P. Baecher, N. Buscher, M. Fischlin, and B. Milde. Breaking recaptcha: Aholistic approach via shape recognition. In Future Challenges in Security andPrivacy for Academia and Industry, pages 5667. Springer, 2011.

[10] E. Bursztein. How we broke the nucaptcha video scheme and what we proposeto fix it. blog post http://elie.im/blog/security/howwe-broke-the-nucaptcha-videoscheme-%and-what-we-propose-tofix- it/, February 2012.

25

Page 30: Generic Solving Of Text Based Captcha

Generic Solving of Text-based captchas

[11] E. Bursztein, R. Bauxis, H. Paskov, D. Perito, C. Fabry, and J. C. Mitchell.The failure of noisebased non-continuous audio captchas. In Security andPrivacy, 2011.

[12] E. Bursztein and S. Bethard. Decaptcha: breaking 75% of eBay audioCAPTCHAs. In Proceedings of the 3rd USENIX conference on Offensivetechnologies, page 8. USENIX Association, 2009.

[13] E. Bursztein, M. Martin, and J. Mitchell. Text-based captcha strengths andweaknesses. In Proceedings of the 18th ACM conference on Computer andcommunications security, CCS 11, pages 125138, New York, NY, USA, 2011.ACM.

[14] E. Bursztein, A. Moscicki, C. Fabry, S. Bethard, D. Jurafsky, and J. C.Mitchell. Easy does it: More usable captchas. CHI, 2014.

[15] K. Chellapilla, K. Larson, P. Simard, and M. Czerwinski. Computers beathumans at single character recognition in reading based human interactionproofs (hips). In CEAS, 2005.

[16] K. Chellapilla and P. Simard. Using machine learning to break visual humaninteraction proofs (HIPs). Advances in Neural Information Processing

[17] Systems, 17, 2004. Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. S.Corrado, J. Dean, and A. Y. Ng. Building high-level features using large scaleunsupervised learning. In ICML, 2011.

[18] I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet. Multi-digitnumber recognition from street view imagery using deep convolution neuralnetworks. arXiv preprint arXiv:1312.6082, 2013.

[19] J. Yan and A. El Ahmad. A Low-cost Attack on a Microsoft CAPTCHA. InProceedings of the 15th ACM conference on Computer and communicationssecurity, pages 543554. ACM, 2008.

[20] ”Imagination Paper”. Infolab.stanford.edu. Retrieved 2013-09-28.

[21] ”Asirra is a human interactive proof that asks users to identify photos of catsand dogs”.

Dept. of Computer Science & Engg. 26 SIMAT, Vavanoor

Page 31: Generic Solving Of Text Based Captcha

Generic Solving of Text-based captchas

[22] Spamtech. Cracking the areyouahuman captcha.http://spamtech.co.uk/software/bots/cracking-the-areyouhumancaptcha/,2012.

[23] N. J. Mitra, H.-K. Chu, T.-Y. Lee, L.Wolf, H. Yeshurun, and D. Cohen-Or.Emerging images. ACM Transactions on Graphics, 28(5), 2009. to appear.

[24] Y. Xu, G. Reynaga, S. Chiasson, J.-M. Frahm, F. Monrose, and P. vanOorschot. Security and usability challenges of moving-object captchas: Decod-ing codewords in motion. In Usenix Security, 2012.

[25] C. Cruz-Perez, O. Starostenko, F. Uceda-Ponga, V. Alarcon-Aquino, and L.Reyes-Cabrera. Breaking recaptchas with unpredictable collapse: heuristiccharacter segmentation and recognition. In Pattern Recognition, pages 155165.Springer, 2012.

[26] C. Cortes and V. Vapnik. Support-vector networks. Machine learn-ing,Septmenber 2014.

[27] Opitz, D.; Maclin, R. (1999). ”Popular ensemble methods: An empiricalstudy”. Journal of Artificial Intelligence Research 2014.

[28] Sutton, Richard S. (1984). Temporal Credit Assignment in ReinforcementLearning (PhD thesis). University of Massachusetts, Amherst, MA.

[29] Altman, N. S. (1992). ”An introduction to kernel and nearest-neighbornonparametric regression”. The American Statistician 46, September 2014

[30] Pearl, Judea (1983). Heuristics: Intelligent Search Strategies for ComputerProblem Solving. New York, Addison-Wesley, December 2014.

[31] http://www.encyclopediaofmath.org/index.php/Point of inflection, Jan-uary,2015.

[32] Barghout, Lauren, and Lawrence W. Lee. ”Perceptual information processingsystem.” Paravue Inc. U.S. Patent Application 10/618,543, filed July 11, 2014.

Dept. of Computer Science & Engg. 27 SIMAT, Vavanoor