Designing an unbeatable CAPTCHA
The CAPTCHA has been around for years, but most designs are still not perfect. In fact most can be defeated with some simple image manipulation and a decent OCR algorithm. For those of you that don’t know, a CAPTCHA is an challenge-response system in which the user is presented with a problem that they must solve in order to verify that they are human and not a spam-bot. There are many popular designs around (reCAPTCHA, Egglue, etc) but the important question is which is the best?
There are 5 main types of CAPTCHA:
- Image – An image containing text is displayed to the user, and the user must provide a response based on the contents of that image.
- Sound – The same as an image CAPTCHA, but with a sound clip instead.
- Text – Text is displayed to the user and the user must write it out in a textbox.
- Question – The user is asked a question such as “what colour is the sky?” and they must respond with an appropriate answer.
- Completion – The user must fill in a blank to complete a sentence, for example “The water was ___ and therefore scalded his hand.”
I’m going to address each type individually, and explain the pros and cons.
Image
Image based CAPTCHA systems are perhaps the most widely used, as (when implemented properly) they make it very difficult for automated processes to break them. The most common type of image CAPTCHA involves displaying a set of characters and asking the user to type them into a text box. As a spam bot cannot “read” an image like it can read text, it cannot solve the CAPTCHA. After a while, spam bot writers began using OCR (Optical Character Recognition) algorithms that could analyse the CAPTCHA image to deduce the word or string of characters it contained. To combat this, modern CAPTCHAs distort the image in a way that makes it extremely hard for OCR algorithms to understand whilst still allowing a human to read the image. Some methods of image distortion were quite easily defeated using noise reduction and other techniques, whilst others remain less succeptable to these techniques.
An OCR algorithm works by taking an image and splitting it into individual parts that each contain one character, then analysing each character (using one or more methods) in an attempt to identify it. Splitting the image into individual parts (known as segmentation) requires the algorithm to be able to identify what area each character resides in. Many methods of analysis involve comparing a database of known character images to the input image in order to calculate which is the most likely result. Some algorithms split the image into sections of light and dark (i.e. dark is where part of the letter is) and use this map to generate a series of vectors that describe the character. It then compares these vectors to its database in order to work out which letter is shown.
There are several ways to make recognition difficult for OCR algorithms:
- Background noise – Adding random dots, lines, shapes and characters in the background can make it difficult for OCR algorithms to differentiate between the text itself and the image background.
- Text colour – Making each letter of the text have a different colour to make it difficult to identify characters.
- Character font, size and rotation – Altering the font, size and rotation of each letter makes it difficult for OCR algorithms to correctly identify the character.
- Wave distortion – Twisting and convoluting each character or the string as a whole causes the position and shape of the text to be changed.
- Pinch/Punch distortion – Stretching the image in certain areas results in a similar result to wave distortion.
- Foreground noise – Adding random dots, lines and shapes in the foreground puts small defects in the characters, making it hard for OCR algorithms to tell what the character says.
- Anti-segmentation – Pressing characters together or drawing a thick line at a random angle through the text makes it difficult for the OCR algorithm to split the image into individual characters.
Some of these are somewhat redundant – gaussian noise can be filtered out and different text colours can be removed by converting the image to greyscale and altering the contrast. Anti-segmentation works well, but can often make it difficult for humans to read the image too. The same applies to distortion and foreground noise – too little and it is pointless, too much and it makes it hard to read for humans.
One of the biggest mistakes in creating CAPTCHAs is using real words. I know I’m going to take a lot of heat for saying that, as reCAPTCHA and many other popular systems use real words, but it’s a case of giving the spam bot more information than you need to. Consider how T9 predictive text works on phones – as you type letters it compares your input so far to a list of known words and uses it to deduce what you were trying to type. If a bot can deduce that your CAPTCHA has 7 letters and that letters 1, 2, 5 and 6 are “ob__qu_”, it then only has to decide between “oblique”, “obloquy” and “obsequy”. If the bot can calculate whether the second letter is more likely “s” or “l”, and the last letter is more likely “e” or “y”, it has solved the CAPTCHA. Even if it cannot definately (or even probablistically) say which word it is, it still has a 33% chance of choosing the right answer if it randomly selects a possible word. If you use random strings such as “K2Tv6p” the bot cannot use dictionary prediction to solve your CAPTCHA.
Warping your text using a wave or pinch/punch method with random parameters makes it hard for bots to compare the characters against a database, as the warp alters not only the position and rotation of the character but also causes straight lines to become curved.
Some CAPTCHA systems display an image that contains characters of two or more different colours and asks the user to only write down the letters that are of a certain colour. For example if the image displayed “QPT7XB5H6R” and asked them to write in the white letters, their response would be T7BHR. The problem with this is that you are telling the spam bot what colour it should search for to find the right letters, which makes segmentation much easier.
Another type of image based CAPTCHA involves showing the user a picture of an object (e.g. a cat, hammer, bowl, etc) and asking them to respond with the name of the object. Whilst this system is easy to create, its downfall is that it is very easy to request the CAPTCHA image repeatedly until you have a copy of each image. When the challenge is issued, the image file’s hash can be compared to it and the CAPTCHA is solved. If the CAPTCHA inserts subtle noise or modifies the image slightly so that the file hash changes, the bot can simply split the image up into blocks and calculate the average R, G and B levels for each. The image in the database that most closely matches the challenge image is chosen.
Sound
When designing image based CAPTCHA systems, accessibility must be considered – people that are blind or have reading difficulties are unlikely to be able to solve the CAPTCHA properly. Most websites that use a CAPTCHA give a link to a page where the administrator can be contacted in order to disable CAPTCHAs for a specific account or IP address. Providing an audio copy of the sound is not advised, since manipulating audio in PHP or ASP is difficult and it isn’t difficult for a bot to capture the audio and analyse it to deduce the contents of the CAPTCHA. In general, sound CAPTCHAs are not very secure.
Text
Some CAPTCHA systems must be installed on servers that do not have image manipulation libraries such as GD available, and so must resort to other methods of displaying a CAPTCHA. Text based CAPTCHA systems display a word or phrase and the user must type it out. Unfortunately this has the downside that the text (or at least a derivative of it) must be stored in the page’s code. To make it more difficult for bots to break, the CAPTCHA system can write the letters in separate elements and shuffle their positions round using CSS and/or JavaScript. This means that “<span>bought</span> <span>you</span> <span>cat</span> <span>a</span>” in the code could be re-organised to read “you bought a cat” when the browser displays it. By studying your script, however, the bot writer could easily reverse engineer your code to allow the bot to decode the solution.
Other CAPTCHA systems display the word in ASCII art to make it difficult for bots to understand. Mixed styles and random alterations can make it quite difficult for a bot to deduce what the text says, but by looking for the patterns between spaces and characters a map can be generated in order to effectively guess the solution.
Question
Question based CAPTCHAs may use text or an image to pose a question to the user which they must answer correctly to complete the CAPTCHA. The problem with questions is not only that they must be incredibly easy so anyone can solve them, but that many questions must be entered in order to make it hard for the bot to work out which one is being asked. In the case of text questions, natural language processing can usually deduce the answer, but in most cases one may simply make a template of every question (e.g. “What is 34 + 12?” becomes “What is x + y?”) and match the template in order to solve the problem. Other questions are ambiguous, for example most people would respond to “what colour is the sky?” with “blue”, but it is also black at night, orange at sunrise, white when it’s cloudy and a whole bunch of colours if you happen to live near the northern lights. This also requires people to know a good amount of Englsh.
Questions written within an image need to be obscured too, otherwise bots can simply use OCR to extract the question. The problem is that questions are written in a language and therefore dictionary prediction is possible. Obscuring a whole sentence may also make it difficult to read for the user.
Completion
Completion CAPTCHAs are interesting, because they have a very fluid structure – i.e. there are billions of possible challenges that the script can issue. Natural language processing of books and online texts may allow hundreds of thousands of appropriate sentences to be extracted and imported into a database. They usually require the user to be quite literate, but not so much that it prohibits users with limited English from solving the CAPTCHA. The limitations are still similar to question based CAPTCHAs, but since a sentence with a blank isn’t actually asking for a logical response (whereas “what is 5 plus 66″ is) it makes it harder for bots to deduce an answer.
In a nutshell
It is best to use an image based CAPTCHA with enough distortion and noise to make analysis difficult without making it too hard to read. Adding a line that is about the width of the lines in the characters through the text or squashing the letters together helps prevent bots from performing segmentation properly. Multiple fonts make a CAPTCHA more resiliant, whilst multiple colours often does not. Using words instead of random strings is a bad idea. It is a good idea to randomize the rotation and vertical positions of characters, as well as the character spacing.
If you’ve not got image manipulation libraries available, use ASCII art to display a random string instead. Randomize the characters you use to write the art with where possible, and insert a few characters in random places to throw off the bots. You can also jumble the ASCII around with CSS and JavaScript.
And that’s about it – go forth and create good CAPTCHAs!
Posted: September 23rd, 2009
at 5:54pm by admin
Categories: Development, Tutorials
Comments: 1 comment


