Today I tried to implement AI-based hand detection in pure C++. At first I tried to use Mediapipe, thinking that there will be some C++ Mediapipe libraries, but soon later I found that there exists only the python library.
This was bit of crisis for me, as the deadline for submitting the project is due in two days and we are missing this crucial feature of detecting the hands..
And then I got the idea!
I can just use shared memory. Main program(C++) just create a process/thread that executes python agent that constantly detects the hand landmarks from given video device and send the information through the shared memory with the video frame. Simple!
We just need to make some simple python program that reads the video frame and gets the coordinates of hand landmarks using Mediapipe.
Shared memory system in linux works by keys. When a shared memory is created by one processor, it has a key that can be used by other processes to access it. This key can be any integer value; as long as the processes that is sharing the memory knows the key, it can be anything.
In C/C++, sys/shm.h header provides myriads of shared memory related functions, such as shmget(), shmctl() or shmat() functions. In Python, sysv_ipc module provides access to System V shared memory, which is basically same as sys/shm.h in C/C++. What we need to do is just create a shared memory and send the data with some data structure.
The process of communication follows :
It's quite simple. There's no need for complicated synchronization for this communication, as there only exists one sender and one receiver. I also added command arguments for python agent on what video device it should use to capture the frame. The C++ main program passes on the video device via this argument, upon executing the agent.
Agent source :
import cv2
import mediapipe as mp
import sysv_ipc
import os , sys , pickle
import argparse , sys
from struct import *
mp_drawing = mp.solutions.drawing_utils
mp_hands = mp.solutions.holistic
hands = mp_hands.Holistic()
parser = argparse.ArgumentParser()
parser.add_argument('--video' , help="Select the video device(default : /dev/video0)" , default="/dev/video0")
parser.add_argument('--debug' , help="Select whether the program runs on debug mode" , default="0")
args = parser.parse_args()
def main():
debug = args.debug == '1'
cap = cv2.VideoCapture(args.video)
video_width = cap.get(cv2.CAP_PROP_FRAME_WIDTH)
video_height = cap.get(cv2.CAP_PROP_FRAME_HEIGHT)
if video_width == 0:
print("failed reading the video frame!")
return
width = 800
height = int(video_height*(width/video_width))
print(f"width = {width}\nheight = {height}")
if width == 0 or height == 0:
print("Failed reading the video device!")
return
memory_hand_info = sysv_ipc.SharedMemory(3141591)
memory_img = sysv_ipc.SharedMemory(3141592)
# Send the width, height and channel information
memory_hand_info.write(pack("iii" , int(width) , int(height) , 3))
while cap.isOpened():
ret , frame = cap.read()
frame = cv2.resize(frame , (width , height));
if not ret:
print("Cannot receive frame!")
break
results = hands.process(cv2.cvtColor(frame , cv2.COLOR_BGR2RGB))
left_hand_landmarks_coords_x = [0]*21
left_hand_landmarks_coords_y = [0]*21
right_hand_landmarks_coords_x = [0]*21
right_hand_landmarks_coords_y = [0]*21
if results.left_hand_landmarks:
for i , l in enumerate(results.left_hand_landmarks.landmark):
x = int(l.x*frame.shape[1])
y = int(l.y*frame.shape[0])
# cv2.circle(frame , (x , y) , 5 , (0 , 255 , 0))
# cv2.putText(frame , str(i) , (x+10 , y) , cv2.FONT_HERSHEY_SIMPLEX , 0.5 , (0 , 0 , 255))
left_hand_landmarks_coords_x[i] = x
left_hand_landmarks_coords_y[i] = y
if results.right_hand_landmarks:
for i , l in enumerate(results.right_hand_landmarks.landmark):
x = int(l.x*frame.shape[1])
y = int(l.y*frame.shape[0])
# cv2.circle(frame , (x , y) , 5 , (0 , 255 , 0))
# cv2.putText(frame , str(i) , (x+10 , y) , cv2.FONT_HERSHEY_SIMPLEX , 0.5 , (255 , 0 , 0))
right_hand_landmarks_coords_x[i] = x
right_hand_landmarks_coords_y[i] = y
if debug: cv2.imshow("camera" , frame)
# write the frame to the shared memory
memory_img.write(frame)
# write the landmarks information to the shared
hand_info_struct = pack("iii" , int(width) , int(height) , 3)
hand_info_struct += pack("ii" , int(results.left_hand_landmarks != None) , int(results.right_hand_landmarks != None))
hand_info_struct += pack(f"{len(left_hand_landmarks_coords_x)}i" , *left_hand_landmarks_coords_x)
hand_info_struct += pack(f"{len(left_hand_landmarks_coords_y)}i" , *left_hand_landmarks_coords_y)
hand_info_struct += pack(f"{len(right_hand_landmarks_coords_x)}i" , *right_hand_landmarks_coords_x)
hand_info_struct += pack(f"{len(right_hand_landmarks_coords_y)}i" , *right_hand_landmarks_coords_y)
memory_hand_info.write(hand_info_struct)
if debug and cv2.waitKey(1) == 27:
break
cap.release()
cv2.destroyAllWindows()
if __name__ == '__main__':
main()
Although the source code looks quite complicated, it is actually not. The agent simply reads one frame using OpenCV library, put the frame into AI model, gather the coordinates of the landmarks and finally send all the informations - frame, width, height, channel and all handmark coordinates to shared memory.
Now I just have to make a C++ program that execute the agent and read the frame from the shared memory.
void *hand_detection::ai_agent(void *data) {
std::string str = "python3 hand_detection_python.py --debug 0 --video ";
str += std::string(GlobalHandDetectionSystem::get_self()->video_device);
std::system(str.c_str());
// signal the main process that the program has ended
((bool*)data)[0] = false;
pthread_exit(0x00);
}
GlobalHandDetectionSystem structure :
struct GlobalHandDetectionSystem {
// singleton design
static GlobalHandDetectionSystem *get_self(void) {
static GlobalHandDetectionSystem *p = 0x00;
if(!p) p = new GlobalHandDetectionSystem;
return p;
}
char video_device[64];
pthread_t agent_thread;
int agent_thread_id;
bool thread_running = false;
int screen_width;
int screen_height;
int screen_channels;
void *shm_addr_imgmem = 0; // address to the SHM of image
void *shm_addr_infomem = 0; // address to the SHM of info structure
};
Structure for communicating the hand landmarks :
typedef struct {
int width;
int height;
int channels;
int left_hand_visible;
int right_hand_visible;
int left_hand_landmarks_xlist[21];
int left_hand_landmarks_ylist[21];
int right_hand_landmarks_xlist[21];
int right_hand_landmarks_ylist[21];
}hands_info_t;
This is the result! (Yes, I took this video tomorrow. Who cares..?)
This is completely unrelated stuff, but the sheet music on the piano is Tchaikovsky-pletnev Concert suite from The Nutcracker No.7, Pas de deux. It's a really good piece, and I highly encourage you to listen!