Object Detection from a video with YOLOv3 in Python

In recent years, deep learning approaches have achieved high-level results for object recognition. YOLO predicts the bounding box coordinates and class probabilities for these boxes by taking the full image in a single instance.

YOLO is a real-time object detection technique that uses neural networks.  The greatest benefit of employing YOLO is its incredible speed.

In this tutorial, we’ll learn how to create a YOLOv3 model for video object detection in Python programming.

So, let’s get this tutorial started…

First and foremost, make sure that you have imported all of the necessary Python packages.

import numpy as np
import cv2

Specify the detection confidence and non-max suppression thresholds, and then derive the YOLO weights and model configuration paths.

confidenceThreshold = 0.5
NMSThreshold = 0.3
modelConfiguration = 'cfg/yolov3.cfg'
modelWeights = 'yolov3.weights'

Import the COCO class labels that our YOLO model was trained on and initialize a palette of colors to indicate each class label.

labelsPath = 'coco.names'
labels = open(labelsPath).read().strip().split('\n')
COLORS = np.random.randint(0, 255, size=(len(labels), 3), dtype="uint8")

Import the YOLO network and choose only the output layer names that we require, as well as the video on which we want to perform object detection.

net = cv2.dnn.readNetFromDarknet(modelConfiguration, modelWeights)

outputLayer = net.getLayerNames()
outputLayer = [outputLayer[i[0] - 1] for i in net.getUnconnectedOutLayers()]

video = cv2.VideoCapture('video.mp4')
writer = None
(W, H) = (None, None)

    prop = cv2.CAP_PROP_FRAME_COUNT
    total = int(video.get(prop))
    print("[INFO] {} total frames in video".format(total))
    printf("Could not determine no. of frames in video")

count = 0

Construct an input blob now. Set the network’s input blob, then run inference across the network and obtain predictions from the output layers.

Loop through all of the detections and get the current item detection’s class ID and confidence. After that, weak predictions are filtered away by verifying that the detected probability exceeds the minimum probability.

To determine the top and left corners of the bounding box, use the center (x, y)-coordinates. Update the bounding box coordinates, confidences, and class IDs in our data.

while True:
    (ret, frame) = video.read()
    if not ret:
    if W is None or H is None:
        (H,W) = frame.shape[:2]

    blob = cv2.dnn.blobFromImage(frame, 1 / 255.0, (416, 416), swapRB = True, crop = False)
    layersOutputs = net.forward(outputLayer)

    boxes = []
    confidences = []
    classIDs = []

    for output in layersOutputs:
        for detection in output:
            scores = detection[5:]
            classID = np.argmax(scores)
            confidence = scores[classID]
            if confidence > confidenceThreshold:
                box = detection[0:4] * np.array([W, H, W, H])
                (centerX, centerY,  width, height) = box.astype('int')
                x = int(centerX - (width/2))
                y = int(centerY - (height/2))

                boxes.append([x, y, int(width), int(height)])

To suppress weak, overlapping bounding boxes, use non-maxima suppression. Ensure that at least one detection exists, and then loop over the indexes we’ve stored, draw a bounding box rectangle and label it on the frame.

    detectionNMS = cv2.dnn.NMSBoxes(boxes, confidences, confidenceThreshold, NMSThreshold)
    if(len(detectionNMS) > 0):
        for i in detectionNMS.flatten():
            (x, y) = (boxes[i][0], boxes[i][1])
            (w, h) = (boxes[i][2], boxes[i][3])

            color = [int(c) for c in COLORS[classIDs[i]]]
            cv2.rectangle(frame, (x, y), (x + w, y + h), color, 2)
            text = '{}: {:.4f}'.format(labels[classIDs[i]], confidences[i])
            cv2.putText(frame, text, (x, y - 5), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)

            if writer is None:
                fourcc = cv2.VideoWriter_fourcc(*'MJPG')
                writer = cv2.VideoWriter('output.avi', fourcc, 30, (frame.shape[1], frame.shape[0]), True)
    if writer is not None:
        print("Writing frame" , count+1)
        count = count + 1

It’s all done now, hurrah! Let’s have a look at the results.


So, in this tutorial, we learned how to use YOLOv3 to identify objects successfully. I hope you all enjoyed this tutorial.

Leave a Reply

Your email address will not be published.