Haar LBP and HOG - Experiments in OpenCV Object Detection
I've spent some time lately coming up-to-speed and playing with OpenCV - especially the object detection routines. OpenCV is an open source computer vision library, currently in version 3.1. As far as I can tell it is the most widely used such library. I think too that the library is fairly unique in its comprehensive scope. OpenCV is a catch-all of vision related utilities. Its functionality ranges from the most pedestrian routines for opening and manipulating a graphic file, to elaborate implementations of full computer vision algorithms.
My original interest in OpenCV was at the more basic end. I want ways to be able to access graphic files and feeds from a camera, and to do some preprocessing before handing the image off to a neural network for recognition. OpenCV is rich with options for identifying shapes and colors within images, finding edges of objects, tracking motion and more.
As I dug deeper into OpenCV I realized, somewhat to my surprise, how mature and capable some of the library's object detection algorithms are. Three that caught my eye for further investigation were Haar Cascades, Local Binary Patterns (LBP), and Histogram of Oriented Gradients (HOG). These capabilities are often presented in terms of sample applications. For example, the library includes a fully trained human face detector that lets you build face detection into a project with just a few lines of code. However, the library also includes the necessary functionality to develop customized object recognizers that use these very same algorithms.
In addition to providing detectors with encouraging accuracy, these algorithms produce detectors that can be very efficient in terms of speed and other computing resources. The detectors are efficient enough to add basic vision to projects built around the current generation of single board computers - namely my Raspberry Pi 2 (soon to be upgraded to the recently announced Raspberry Pi 3).
Haar, LBP and HOG have a lot of similarity at the macro level. Fundamentally the algorithms are each concerned with extracting a mathematical model of the object image that teases out identifiable features such as shapes or textures. These abstractions are often referred to as feature descriptors or visual descriptors.
To build an object detector out of feature descriptors, hundreds (or better thousands) of images are collected, the feature descriptors are extracted, and then a detector is "trained" against an even larger library of negative images. Once a working, trained detector is available, usage is generally quite simple. A program passes an image and the trained detector to an OpenCV object detection function and, if matching objects are detected in the image, the function returns coordinates for one or more rectangular regions of interest (ROI). These ROIs are often presented in software by a familiar, colored bounding box around the recognized object.
My work to-date has been very iterative. I've been experimenting with three different sets of images, each with very different characteristics, and moving back and forth among projects as I develop insight.
My first experimental project was to build a detector for U.S. Interstate signs. The Interstate Highway system uses a highly standardized format for these highway markers. I had seen a successful detector project based on Stop signs that I wanted to replicate without necessarily duplicating. (Note that this website also includes some very helpful tutorial information for training Haar Cascades in OpenCV)
Road signage seems to sit somewhere in the Goldilocks zone of object detection challenges - not too hard but not too easy. They are rigid objects with well defined edges that are highly visible by design. At the same time, road signs are found outdoors in a variety of settings - mountains, woodlands, grasslands, deserts, urban landscapes and more. Furthermore, photographs might be taken at any season, any time of day, and from any angle.
The project was made easier by my discovery of a website called AARoads that includes, among other things, a gallery of user submitted photographs of Interstate signs. There was no downloadable database, but it was trivial to select and save a batch of the images. In a reasonable time I had identified 270 positive samples and another 20 for testing. (A golden rule in machine learning is to conduct tests using images that were not included in the original training set.)
The second training set is of wild birds found in nature. I would someday like to be able to detect animals found in natural habitats, so this set is right in line with my longer term goals. It might be pointing out the obvious, but birds in natural habitats are a good deal more challenging than highway signs. Where highway signs are designed to be seen, birds are frequently camouflaged, designed by evolution to blend in with their surroundings. Their shapes are much less well defined with lots of variation among species. Even the same specimen will show a variety of shapes as it moves through the world - in computer vision lingo they are "deforming" objects.
Once again my task was made profoundly easier by the discovery of a pre-existing database of usable bird images, this time made available as a downloadable database specifically for computer vision experimentation by the California Institute of Technology (Caltech). Caltech-UCSD Birds-200-2011 (CUB-200-2011) is an annotated dataset containing an impressive 11,788 birds from 200 different species. I wound up selecting 2,200 for positive samples with another 30 for testing.
The third dataset I've been working with is made up of toy plastic farm animals against a mocked-up outdoor setting on my work table. This dataset expands the experimentation into a proof of concept project to start actually integrating object detection with other AI/robotics capabilities on a Raspberry Pi. It includes four different toy animals in a variety of poses - 1,800 pictures in total.
Of course there was no convenient online database. After building a somewhat bizarre scaffolding structure to mount a webcam above my pastoral farm scene diorama, I programmed the Pi to take batches of photos. The Pi would take one picture every five seconds and in between I would reposition the subject animal or animals.
It is difficult, or perhaps even impossible, to evaluate the OpenCV implementations of the three algorithms and determine that one is better than the others. There are simply too many variables at play. When one algorithm appears to perform better than another on a task, there is no objective way to rule out some subtlety in the preparation of the training images, a nuance of the many individual parameters, or simply the ineptitude of the experimenter. (In fairness to myself I'll note that even in highly respected, milestone publications on machine vision experiments, researchers are often forced to confess that they are unsure which factors contributed to improved results and which factors hindered them.)
Nonetheless, with some hands-on experience under my belt, there are quite a few observations to be made.
With a little effort, Haar, LBP and HOG all work very well. I've been able to train detectors to recognize certain objects within photographs at rates exceeding 90%. Not all detectors have achieved such stellar results, some types of objects and scenarios seem to be better suited to detection by these methods than others.
Somewhat to my surprise, the detectors built around the toy animals dataset were the overall least successful or gratifying. LBP provided the best results, but even so there were far too many false positives in which the detector reported an object where there was none. More importantly there were many false negatives where the detector missed the object altogether. I suspect that a factor in the poor performance was the relative blandness of the dataset. I originally thought that a simplified environment would be make for a good proof of concept experiment, but in hindsight it seems that the training algorithms might do better with greater variation and contrast in the data.
The detectors are pretty sensitive to the orientation of the object. So for an asymmetrical object I found myself building several detectors in different orientations. For example, each of the plastic farm animals use three detectors - left profile, right profile, and forward facing.
The wild bird project is an interesting case study in this regard. My first attempt used training images with a full bird body from any perspective. The results were terrible and the ROI bounding boxes didn't appear much different than random boxes drawn on the test images. For my second effort I selected only birds in a left facing profile and cropped the image down to the head and neck. The results were dramatically improved. Even though I'm still a good ways off from "perfect," I have the impression that a satisfying bird detector is possible.
One area in which the algorithms differ dramatically is in the time required to train the detector and the time required to run a detector once trained. The training time for a Haar cascascade is striking. With only 270 positive samples, the Interstate sign project took about 14 hours to train to a modest level of precision on my newish Intel powered quadcore desktop. For larger datasets trained to a higher precision, people routinely report Haar trainings that are measured in multiple weeks of round-the-clock operation. In contrast, on the same Interstate dataset, the LBP training took about 15 minutes, and HOG training was less than a minute. Although always faster than Haar on similar data, LBP training times will quickly become lengthy as well, with various experiments pushing up over 10 hours depending on data and training parameters.
The problem with the long training times on Haar and LBP is not that it takes a long time to create a detector, but that it makes it difficult to experiment with variations in the data and parameters. With Haar at the extreme end, it could take months or even years to optimize the training for a specific purpose. Of course, if I had access to more computing resources the issue would be negated. The expanding availability of inexpensive cloud-based resources appears to already be having an impact on people's ability to carry out this type of experimentation, but that is not a path I'm planning to take at the moment.
As a practical matter, the more critical factor is the time it takes to run a trained detector. In this speed race, LBP is the clear winner. On the same image, an LBP detector will run about 10-15% faster than Haar. The real difference is seen when running LBP against HOG, where LBP executes about 10 times faster - certainly enough of a speed variance to make the difference between a usable and unusable method on a resource constrained vision project.
Depending on the use case, for the Raspberry Pi the techniques run right at the edge of usable. In my own experiments, it was taking less than 2 seconds to run an LBP detector on a standard 640 by 480 picture from the camera. Obviously not nearly fast enough for real-time video, but certainly fast enough for a fun robotics project. Remember also, that this time requirement is for one detector. To detect multiple object types or multiple perspectives would take proportionately longer. There are steps you could take to speed the process along. A smaller picture could be used and there are OpenCV settings which adjust the scalable, sliding window used in the detection. Of course the best way to improve the speed of detection in projects would be to use a faster CPU.
Although somewhat outside the realm of performance, one of my most significant take-aways from this group of experiments is that preparing images for computer vision experimentation at this level is an enormous and almost overwhelming enterprise. First, all the images for training, both positive and negative, need to be assembled either by taking the photographs yourself, acquiring an existing image database, or haphazardly searching the Internet. Once assembled, the images require preprocessing. The training routines here are looking for images of the same size and proportion, cropped or marked to the object to be detected.
OpenCV provides a few utilities to assist in the preprocessing of the training images, but by and large the library does not have tools to help manage and manipulate medium and large sets of images. I found myself cobbling together little scripts and utilities to automate some of the more tedious tasks which did help speed things along. Nonetheless, manual preparation of the training data requires a significant investment of time. If you consider that experimentation with various algorithms and image subject matter would involve processing many thousands of images, perhaps multiple times each, then even if you setup a very efficient environment, processing times could easily grow to tens or hundreds of hours.
In conclusion, as I've mentioned several times in this write-up, I'm very impressed by the performance of these techniques. I'm particularly interested in LBP for its combined speed and accuracy. At this juncture I've actually returned to my original investigations and have been experimenting with neural network processing of similar data to that used here. However, I'm relatively certain I'll be returning to LBP soon as I work to get vision off the workbench and onto a functioning robot.
May 6, 2016
About the Author: Ralph Heymsfeld is the founder and principal of Sully Station Solutions. His interests include artificial intelligence, machine learning, robotics and embedded systems. His writings on these on other diverse topics appear regularly here and across the Internet.
An Arduino Neural Network
An artificial neural network developed on an Arduino Uno. Includes tutorial and source code.
Buster - A Voice Controlled Raspberry Pi Robot Arm
Buster is a fully voice interactive robot arm built around the Raspberry Pi. He acts upon commands given in spoken English and answers questions too.
iCE40 and the IceStorm Open Source FPGA Workflow
Project IceStorm is the first, and currently only, fully open source workflow for FPGA programming. Here, the software and hardware are discussed and a small sample project implemented.
Migrating to the 1284P
The ATMEGA1284P is one of the more capable microcontrollers available in the hobbyist and breadboard-friendly 40-pin PDIP package. Here I discuss migrating the neural network project to the 1284p to take advantage of its relatively generous 16K RAM.
Getting Up and Running With a Tamiya Twin-Motor Gearbox
Tamiya makes a full line of small gearbox kits for different applications that are capable for their size and an easy, economical way to get a small to medium size wheeled robot project up and running.
Flexinol and other Nitinol Muscle Wires
With its unique ability to contract on demand, Muscle Wire (or more generically, shape memory actuator wire) presents many intriguing possibilities for robotics. Nitinol actuator wires are able to contract with significant force, and can be useful in many applications where a servo motor or solenoid might be considered.
Precision Flexinol Position Control Using Arduino
An approach to precision control of Flexinol contraction based on controlling the voltage in the circuit. In addition, taking advantage of the fact that the resistance of Flexinol drops predictably as it contracts, the mechanism described here uses the wire itself as a sensor in a feedback control loop.
LaunchPad MSP430 Assembly Language Tutorial
One of my more widely read tutorials. Uses the Texas Instruments LaunchPad with its included MSP430G2231 processor to introduce MSP430 assembly language programming.
K'nexabeast - A Theo Jansen Style Octopod Robot
K'nexabeast is an octopod robot built with K'nex. The electronics are built around a PICAXE microcontroller and it uses a leg structure inspired by Theo Jansen's innovative Strandbeests.