Real Time Object Detection Using YOLO You-only-look-once (YOLO) is one of the recent breakthroughs in deep learning for object detection and recognition from images/videos. It outperforms several other architectures like Regional-CNNs (RCNNs) and Deformable-Parts Models (DPMs) in terms of speed of object detection (RCNNs are more accurate, but slower).
For object detection in a testing image, one might use a trained classifier like VGGNet or Inception and turn it into an object detector by sliding a small detection filter across the image. This will give several predictions for that image, but the ones which are relevant are the ones that the classifier will be most certain about. This approach works to a good standard of accuracy, but obviously it is going to be slow. A slightly more efficient approach is to first predict which parts of the image will be containing interesting information, these parts are called as region-proposals, and then run the classifier only on these regions. It is imperative that in this case, the classifier has to do much lesser work in comparison to the previous approach, but still the classifier will be running multiple times over the same image. Unlike RCNNs which perform detection on various region proposals and end up making multiple predictions for various regions in the same image, Yolo uses a single CNN for both detecting the object (localizing) and classifying the category of the object. Hence, the classifier simply runs on the image once, thus you only look once!
Limitations in the ability of humans to vigilantly monitor video surveillance live footage led to the demand for AI. Humans watching a single video monitor for more than twenty minutes lose 95% of their ability to maintain attention sufficient to discern significant events. AI is giving surveillance cameras digital brains to match their eyes, letting them analyze live video and identify object with no humans necessary.
Inventory management can be very tricky as items are hard to track in real time, something is always added, removed and moved every day. AI system can perform automatic object counting and localization that will allow you to improve inventory accuracy also removing human error by accurately counting your holding and outgoing inventory.
The Architecture consists of 24 convolutional layers and 2 fully connected layers. The initial convolutional layers extract features from the image, and the fully connected layers predict the output probabilities and coordinates.
Building the Network
The first 20 layers of the convolutional layer followed by a average pool and a fully connected layer are trained on the 100 class image net (2012) dataset. This network is trained, to achieve a single crop top-5 accuracy of 88% on the image net (2012) validation dataset. There are four convolutional layers and two fully connected networks added to pre-trained network and weights are randomly initialized. Normalized bounding box width and height by image width and height so that they fall between 0 and 1. Leaky RELU is used as linear activation function for the final layer and all other layers
Training the Network
The model is trained on Pascal VOC 2007 and 2012 datasets of batch size 64. Momentum of 0.9 and decay of 0.0005. Learning rate varied as 10-2 for 75 epochs, 10-3 for 30 epochs, 10-4 for 30 epochs. Dropout layer and data augmentation are also used. The version of Yolo which we are using is trained on the Pascal VOC dataset, which can detect 20 different classes of Objects. Yolo has been benchmarked to run at 30 FPS to 200 FPS on a Titan X GPU. We have tested the same and it is giving around 40-50 FPS speed, which is reasonable and can be called as real time since the average processing speed of the human eye is around 30 FPS. It is therefore feasible to integrate with the live webcam stream which we have planned.
Build this in Big Brain in just 4 steps within 30 minutes: