Table of Contents
Tracking Identity in Video
This is the second to the Judo and Deep Learning case study. You can find the introduction here, and the previous entry here.
Previously, we were able to apply some basic pixel math to a single cropped image of a detected player to determine whether that Judoka was in a White or Blue Gi. The way we did this was a poor choice in hindsight (and probably foresight, to be honest), but lets show our work first.
Applying our method to a video capture
We'll need to read in not just the first frame of a video, but each frame in sequence. Luckily this is very easy in opencv
.
# load the same model and video as before model = YOLO('yolov8n-pose.pt') example_file_path = "/Volumes/trainingdata/edited/koshi guruma/13.mp4" # open a capture stream cap = cv2.VideoCapture(example_file_path) while cap.isOpened(): success, frame = cap.read() if cv2.waitKey(1) & 0xFF == ord("q"): break if success: results = model(frame)
results
contains a collection of Result
objects (documentation for that here), which contains our bounding boxes, masks, probabilities, keypoints, etc. Each result
is a person that is found (or thought to be found) by the model.
For each result
in each frame, we can create an annotation of that frame (documentation for that here), and then iterate over each Box
in the result to find the gi color of the person contained within the boundary box.
for result in results: annotator = Annotator(frame) for box in result.boxes: converted_coords = list(map(int,box.xyxy[0])) debugShowRectangle(frame, converted_coords) # for my own debugging, to confirm that the area being checked was correct player_area = getCroppedPlayerArea(frame, converted_coords) grayscale = cv2.cvtColor(player_area, cv2.COLOR_BGR2GRAY) gi_color = getGiColor(grayscale) print(gi_color.value) annotator.box_label(box.xyxy[0], f"{gi_color}")
Using the same helper methods (though slightly modified) that we laid out in our previous entry, we'll get a cropped section of the video frame to parse and determine the gi color contained within.
def debugShowRectangle(image, box): left, top, right, bottom = box cv2.rectangle(image, (left, top), (right, bottom), (0, 255, 0), 3) def getCroppedPlayerArea(image, player): return image[player[1]:+player[3], player[0]:player[2]] def getGiColor(grayscale_image): print("values >= 127: ") print(np.sum(grayscale >= 127)) print("values <= 127: ") print(np.sum(grayscale <= 127)) print("total values: ") print(np.sum(grayscale)) return GI_COLOR.WHITE if (np.sum(grayscale >= 127) > np.sum(grayscale <= 127)) else GI_COLOR.BLUE
However, when this code is ran over the entire video, here is the result.
Clearly we have an issue of how we are parsing who is who.
Critiquing the results
What are my other options?
What we have here is a series of unknown unknowns. When I first tackled this, I had no idea about how to create custom datasets or classes of objects. What I need to do is this:
- Create a annotated dataset that contains two class labels: `blue` and `white`. I can use a handful of videos and images I have already made to create this dataset. Using yolov8 I can easily export these keypoint annotations into a format that YOLO expects, and create a labelled training and validation set.
- Train a model on the custom dataset, and test to see if our results are any different.
Creating an annotated dataset
Where do we start? And moreso, how do we even do this?
In my research there are a handful of opensource annotation tools that I can use to create keypoints for pose detection. All that really matters is that the keypoints are in the proper order. YOLOV8
has a specific order that it expects all data to adhere to:
- Nose
- Left-eye
- Right-eye
- Left-ear
- Right-ear
- Left-shoulder
- Right-shoulder
- Left-elbow
- Right-elbow
- Left-wrist
- Right-wrist
- Left-hip
- Right-hip
- Left-knee
- Right-knee
- Left-ankle
- Right-ankle
As long as the keypoint data is in that order format, we should be able to use any labelled tool we like. We can also use either 2D
(x, y), or 3D
(x, y, visible) tuple formatting for the keypoint itself.
For my purposes, I think I will go with CVAT for labelling. I can create pose estimation and convert the output it from JSON
without too much trouble. This will be the majority of the following entry in this case study.