‎

Tracking Identity in Video

Tracking Identity in Video

This is the second to the Judo and Deep Learning case study. You can find the introduction here, and the previous entry here.

Previously, we were able to apply some basic pixel math to a single cropped image of a detected player to determine whether that Judoka was in a White or Blue Gi. The way we did this was a poor choice in hindsight (and probably foresight, to be honest), but lets show our work first.

Applying our method to a video capture

We'll need to read in not just the first frame of a video, but each frame in sequence. Luckily this is very easy in opencv.

# load the same model and video as before
model = YOLO('yolov8n-pose.pt')
example_file_path = "/Volumes/trainingdata/edited/koshi guruma/13.mp4"

# open a capture stream
cap = cv2.VideoCapture(example_file_path)

while cap.isOpened():
    success, frame = cap.read()
    if cv2.waitKey(1) & 0xFF == ord("q"):
          break
    if success:
          results = model(frame)

results contains a collection of Result objects (documentation for that here), which contains our bounding boxes, masks, probabilities, keypoints, etc. Each result is a person that is found (or thought to be found) by the model.

For each result in each frame, we can create an annotation of that frame (documentation for that here), and then iterate over each Box in the result to find the gi color of the person contained within the boundary box.

for result in results:
  annotator = Annotator(frame)
  for box in result.boxes:

      converted_coords = list(map(int,box.xyxy[0]))
      debugShowRectangle(frame, converted_coords) # for my own debugging, to confirm that the area being checked was correct
      player_area = getCroppedPlayerArea(frame, converted_coords)
      grayscale = cv2.cvtColor(player_area, cv2.COLOR_BGR2GRAY)
      gi_color = getGiColor(grayscale)
      print(gi_color.value)
      annotator.box_label(box.xyxy[0], f"{gi_color}")

Using the same helper methods (though slightly modified) that we laid out in our previous entry, we'll get a cropped section of the video frame to parse and determine the gi color contained within.

def debugShowRectangle(image, box):
    left, top, right, bottom  = box
    cv2.rectangle(image, (left, top), (right, bottom), (0, 255, 0), 3)

def getCroppedPlayerArea(image, player):
    return image[player[1]:+player[3], player[0]:player[2]]

def getGiColor(grayscale_image):
    print("values >= 127: ")
    print(np.sum(grayscale >= 127))
    print("values <= 127: ")
    print(np.sum(grayscale <= 127))
    print("total values: ")
    print(np.sum(grayscale))
    return GI_COLOR.WHITE if (np.sum(grayscale >= 127) > np.sum(grayscale <= 127)) else GI_COLOR.BLUE

However, when this code is ran over the entire video, here is the result.

Clearly we have an issue of how we are parsing who is who.

Critiquing the results

What are my other options?

What we have here is a series of unknown unknowns. When I first tackled this, I had no idea about how to create custom datasets or classes of objects. What I need to do is this:

Create a annotated dataset that contains two class labels: `blue` and `white`. I can use a handful of videos and images I have already made to create this dataset. Using yolov8 I can easily export these keypoint annotations into a format that YOLO expects, and create a labelled training and validation set.
Train a model on the custom dataset, and test to see if our results are any different.

Creating an annotated dataset

Where do we start? And moreso, how do we even do this?

In my research there are a handful of opensource annotation tools that I can use to create keypoints for pose detection. All that really matters is that the keypoints are in the proper order. YOLOV8 has a specific order that it expects all data to adhere to:

Nose
Left-eye
Right-eye
Left-ear
Right-ear
Left-shoulder
Right-shoulder
Left-elbow
Right-elbow
Left-wrist
Right-wrist
Left-hip
Right-hip
Left-knee
Right-knee
Left-ankle
Right-ankle

As long as the keypoint data is in that order format, we should be able to use any labelled tool we like. We can also use either 2D (x, y), or 3D (x, y, visible) tuple formatting for the keypoint itself.

For my purposes, I think I will go with CVAT for labelling. I can create pose estimation and convert the output it from JSON without too much trouble. This will be the majority of the following entry in this case study.

Table of Contents

Tracking Identity in Video

Applying our method to a video capture

Critiquing the results

What are my other options?

Creating an annotated dataset