dozens of camera lens

In our article on image recognition, we looked at how machine learning and computer vision experts use convolutional neural networks (CNNs) to teach computers how to recognize and categorize images. One thing that should be abundantly clear from that article is that designing, building, and training a convolutional neural network from scratch is a time-consuming and complex job.

Fortunately, a number of tech giants and even a few upstarts have developed APIs that allow anyone to plug into their image recognition technology without having to build and train their own CNN. We decided to put some of these APIs to the test using real photographs from our team.

The Contenders

First, we’ll take a look at the APIs in our test. While there are a bunch of image recognition services that have popped up in the last few years, in this article we’ll focus on several mature options from major players in AI and computer vision as well as a couple of high-profile startup entrants.

  • Google Cloud Vision is Google’s visual recognition API, based on the open-source TensorFlow framework and using a REST API. It detects individual objects and faces and contains a pretty comprehensive set of labels. It also comes with a few bells and whistles, including Optical Character Recognition (OCR) and integration with Google Image Search to find related entities and similar images from the web.
  • IBM Watson Visual Recognition is part of IBM’s Watson Developer Cloud. This tool comes with a large set of built-in classes, but is really built for training custom classes based on images you supply. Like Google Cloud Vision, it also supports a number of nifty features, including OCR and NSFW detection.
  • Amazon Rekognition is Amazon’s image recognition API, which (unsurprisingly) integrates with other AWS services.
  • Microsoft Computer Vision API has many of the same features as Google Cloud Vision and IBM Watson, including celebrity detection and OCR. An interesting piece of this API is that in addition to listing tags along with confidence predictions it also attempts to generate a natural-language description based on those tags.
  • is an upstart image recognition service that also uses a REST API. One interesting aspect is that it comes with a number of modules that help tailor its algorithm to particular subjects, like weddings, travel, and food.
  • CloudSight seems to take a slightly different approach to image recognition. While their website offers significantly fewer details than some of their competitors, but their service seems to offer some combination of algorithmic and manual (meaning human) tagging.

For our test, we used several photographs taken with regular camera phones. They’ve been chosen to highlight a few different areas–how well they’re able to recognize and distinguish between animals, people, text, and objects. We haven’t edited the photos in any way, though we have stripped out the metadata to make sure our APIs are just considering the images.

First Test: A Parked Car

This is meant to be a pretty easy one. Our car is large, in the center of the image, and there isn’t much else around.

A white sedan parked along the curb

Google CV IBM Watson Amazon R Microsoft CV Cloudsight
Car Limousine Automobile “a car parked on the side of a building” Car “White classic sedan”
Vehicle Car Car Car Vehicle
Land vehicle Vehicle Vehicle Outdoor Road
Luxury vehicle Berlin (Limousine) Limo Ground Street
Full-size car Beach Wagon Cab Parked Transportation system

As you can see, every API we tested successfully identified the car, with Google Cloud Vision and Cloudsight appropriately characterizing it as a “luxury vehicle” and a “classic white sedan,” respectively. Both IBM’s Watson and Amazon’s Rekognition APIs mistook it for a limousine, while Microsoft Computer Vision and both picked up on some surrounding context as well.

Additionally, we took a look at Google’s suggested web entities, which, unsurprisingly, all had to do with cars (car, compact car, 2005 Ford Thunderbird, luxury vehicle, full-size car, sedan, and performance car), though it’s worth pointing out that the car in question appears to be a late-80s model Mercury Grand Marquis.

Second Test: Two Cats in Bed

This one is meant to be slightly trickier. There are two subjects, neither of which is in the center of the frame. It’s also from a top-down perspective, which might be a little trickier than a straight-on view.

A black cat and a white cat in bed

Google CV Watson VR Amazon R Microsoft CV Cloudsight
Cat Coal black color Animal “a cat laying on a bed with a stuffed animal” Bed “two black and white long coated cat”
Mammal Animal Cat Indoor Cat
Small- to medium-sized cats Feline Pet Bedroom
Cat-like mammal Carnivore Siamese Sleep
Nap Mammal Mammal Portrait

Once again, all of our APIs seemed to get the general drift of our image, though we also saw a little more variation in terms of tags suggested and overall quality. Google Cloud Vision, Cloudsight, and Microsoft Computer Vision all noted that there were two cats (though MCV mistook one of them for a stuffed animal). again picked up on more scenery and context than the other APIs.

Looking at Google’s web entities we found kitten, cat, and nap.

Third Test: The Grand Canyon

With this test we wanted to see how well the APIs handled natural scenes. There’s a lot going on in this picture, what with the canyon, the storm clouds, and the rocks and shrubs in the foreground.

A landscape of the Grand Canyon

Google CV Watson VR Amazon R Microsoft CV Cloudsight
Mountainous landforms Nature Canyon “a view of a mountain” Landscape “brown and gray rock formation”
Nature Valley Valley Mountain No person
Geographical feature Canyon Outdoors Sky Travel
Wilderness Reddish orange color Mountain Valley Desert
Landform Indian red color Ground Outdoor Canyon

This one presented a little more of a challenge for our APIs, judging from the vagueness of many of the tags. Of our six, only Amazon Rekognition correctly identified it as a canyon as its top guess, though, Microsoft CV, and Watson all had it in their top five. Google CV suggested it too, along with “national park,” though these were further down its list.

Fourth Test: Human Subjects

A number of our APIs boast their facial recognition software, so we decided to see just how good they were.

Google CV Watson VR Amazon R Microsoft CV Cloudsight
Tourism Coal black color Human “a man standing on a sidewalk” People “men’s black suit,women’s black blazer”
Street Person People Outdoor Street
Art Demonstrator Person Ground City
Pedestrian Sidewalk Clothing Tree Pavement
Tours Juggler Overcoat Person Woman

This photo generated some interesting results. While five out of six APIs generally recognized that the scene had something to do with people on the street, though none could agree with what exactly they were doing. GCV got closest with tourism, though Watson seemed convinced they were either demonstrators or jugglers. Neither of the APIs that generate natural language descriptions picked up on all three people, even though Microsoft CV managed to pick up all three faces. Interestingly, Cloudsight ignored the people entirely but did register the clothes worn by two of them. As we’ve seen before, and Microsoft CV picked up more scenery than the others.

For this one, we also checked the facial recognition capabilities of five out of our six APIs (Cloudsight doesn’t currently support facial recognition).

Google CV Watson VR Amazon R Microsoft CV Cloudsight
Yes Yes Yes No Yes n/a
Expressions Gender, age Gender, age, expressions

The results here were pretty good. Only Microsoft CV missed the faces entirely. Google CV and Amazon Rekognition agreed that all three people looked happy or joyful, and both Rekognition and Watson correctly identified the genders and agreed on the relative ages of the three people. It should be said, though, that their estimates on our subjects’ ages varied considerably: Watson pegged our man on the left at 55-64 while Rekognition thought he could be anywhere from 60-90.

Fifth Test: Art vs. Life

For this test, we decided to see if our APIs could tell the difference between a painting and a photograph. To give the APIs a sporting chance, we chose a pretty realistic painting and made sure the picture frame was at least partly visible.

Google CV Watson VR Amazon R Microsoft CV Cloudsight
Painting Indian red color People “Person riding a surfboard on top of a book” People “cowboy with horse painting”
Poster Camel racing Person Book Painting
Modern art Racing Human Picture frame Art
Art Sport Angus Cavalry
Mythology Camel Animal Illustration

This image produced more disagreement among our APIs than any of the others. The only consistent point of agreement was that it featured a person, though guesses as to his activity ranged from cowboy (as Cloudsight) to cavalryman ( to camel racer (Watson) to a person surfing atop a book (Microsoft CV). There was also significant disagreement over what animal he was with, and only Cloudsight correctly identified it as a horse. That said, four out of our six APIs realized that it was a painting. Amazingly, Google CV’s web entities managed to guess the artist, museum, and even type of paint (oil) used in its production.

When it comes to the person himself, three out of six APIs recognized a face. Watson could tell that a face was present, though it couldn’t say with certainty what the figure’s age was or even whether it was a man or a woman. Amazon Rekognition and Microsoft CV did much better: Both correctly identified the figure as a roughly middle-aged male, and Rekognition even identified his expression as unsmiling and possibly sad.

Sixth Test: Wine to Finish

Our final test is about recognizing inanimate objects and testing the OCR capabilities of our APIs. We have here a partly obscured bottle of wine with a label written in multiple fonts, along with a glass of wine and a partly visible woman in the back.

Google CV Watson VR Amazon R Microsoft CV Cloudsight
Alcoholic beverage Bottle green color Alcohol “A bottle of beer sits next to a glass of wine” Wine “La Posta Pizzella bottle”
Drink Beverage Beer Table Drink
Beer Food Beer bottle Bottle Alcohol
Beer bottle Brew Beverage Wine Bottle
Liquor Beer Bottle Indoor No person

In this final test, all our APIs realized the scene involved alcohol of some kind, though only unambiguously recognized our bottle as a wine bottle, while four of our other APIs incorrectly tagged the scene with beer or liquor. As usual, Microsoft CV and both used more context tags than the other services. It is worth noting that only Cloudsight seems to have identified the bottle based on the label, though since Cloudsight doesn’t claim to use OCR, that seems likely to be a human-applied tag.

The real test in this final round, though, was in how the OCR capabilities of these different APIs handled the various styles and sizes of text.

Google CV Watson VR Amazon R Microsoft CV Cloudsight
Yes Yes n/a Yes n/a n/a
Somewhat readable Unreadable Highly readable

Here the results were extremely variable. Of the three APIs that employ OCR, only Microsoft CV was able to read most of the essential text on the label without making any mistakes. Google CV managed to extract some complete words and some fragments, but the result would be hard to decipher without knowing what the label actually said. Watson VR struggled the most, only extracting three words, two of which were fragments of the same word.

Where the Technology Is Today

These tests are, admittedly, pretty unscientific. They aren’t meant to be exhaustive comparisons, but rather a general overview of their different capabilities, and a demonstration of the current state of the art. We saw both where image recognition APIs have made great strides, and where they currently fall short.

It’s worth pointing out that in general, every API was able to extract at least partially correct information from every scene. That said, the quality and relevance of the tags was pretty variable. Many tags seemed redundant or vague. (Why apply the tag “vehicle” to an image that’s also tagged “car”?). Others were impressively specific or added potentially relevant context, though the more specific individual tags became (“camel racing,” “beach wagon”) the less accurate they tended to become as well.

One area where the APIs tended to do very well was in facial recognition, which bodes well for its social media applications. (It’s worth noting here that the biggest social media company, Facebook, also has its own image recognition API, though it’s not publicly available for testing.) While OCR algorithms may be very good at reading printed matter on a flat surface, we’ve seen that it’s much spottier when it comes to reading text “in the wild.”

Ready to Get Started?

Impressed by what you saw? Interested in seeing how you can incorporate image recognition into your project? The good news is that these APIs do the computationally difficult and complex parts, so you may not need specific machine learning or data science expertise to get started.

That said, you will need developers who are comfortable implementing APIs, and if your project requires a very specific kind of image-recognition expertise (say, being able to distinguish between different makes and models or cars, or reading real-life signage), you may need to a data scientist to help you train and perhaps build a more specifically tailored model.