Hand gesture recognition via model fitting in energy minimization w/OpenCV

hands with model fittedHi
Just wanted to share a thing I made – a simple 2D hand pose estimator, using a skeleton model fitting. Basically there has been a crap load of work on hand pose estimation, but I was inspired by this ancient work. The problem is setting out to find a good solution, and everything is very hard to understand and implement. In such cases I like to be inspired by a method, and just set out with my own implementation. This way, I understand whats going on, simplify it, and share it with you!
Anyway, let’s get down to business.
Edit (6/5/2014): Also see some of my other work on hand gesture recognition using smart contours and particle filters

A bit about energy minimization problems

A dear friend revealed before me the wonders of energy minimization problems a while back, and ever since I have trying to find uses for that method. Basically, it is trying to find a global minimum for a complicated energy function (usually with many parameters), by following the function’s gradient. Such methods are often called Gradient Descent, and used mostly for non-linear systems that can’t be solved easily using a least-squares variant.
A lot of work in computer vision was done using energy functions (I believe the most seminal was Snakes, over 10,000 citations), usually having two terms: Internal energy and External energy. The equilibrium between the two terms should result in a low-energy system – our optimal result. So we would like to formulate the terms in our system such that when they are 0 – they describe the system as we want it.
Following the works with active contours, I believe the external energy function should have to do with how the hand model fits to the hand blob, and the internal energy will have to do with how “comfortable” the hand is with this configuration.

The hand model

Let’s see how a 2D model of a hand might look like

Kinda looks like a rake… huh?
There are some parts that practically can’t change much, i.e the palm (orange), and some that might change drastically, i.e the fingers (red). Each finger has joints (blue circle), and a tip (bigger blue circle).

typedef struct finger_data {
	Point2d origin_offset;		//base or finger relative to center hand
	double a;					//angle
	vector<double> joints_a;	//angles of joints
	vector<double> joints_d;	//bone length
typedef struct hand_data {
	FINGER_DATA fingers[5];		//fingers
	double a;					//angle of whole hand
	Point2d origin;				//center of palm
	Point2d origin_offset;		//offset from center for optimization
	double size;				//relative size of hand = length of a finger

At first I thought, since I’m only interested in the tips of the fingers, to use Inverse Kinematics to guide the tips to a certain point and let the joints find their own minimal energy position, following this article. But I abandoned this method because of complications.
I also had to simplify this model, for real-time estimation and also better results. So in the end I ended up with a very rigid model, that allows only on joint per finger and no angular movement.

Using tnc.c

tnc.c is a “library”, essentially one c file, that implements a line search algorithm that is able to find the minimum point of a multi-variate function. I’m not certain of the algorithm details, and it’s not so important as it can be replaced with any other similar library. But, tnc.c has a great advantage – it is dead simple. One function will start the gradient decent, calling-back a function to calculate the gradients.
So basically I had to write just one very short function:

static int my_f(double x[], double *f, double g[], void *state) {
	DATA_FOR_TNC* d_ptr = (DATA_FOR_TNC*)state;
	DATA_FOR_TNC new_data = *d_ptr;
	*f = calc_Energy(new_data,*d_ptr);
	//calc gradients
		double _x[SIZE_OF_HAND_DATA];
		for(int i=0;i<SIZE_OF_HAND_DATA;i++) {
			memcpy(_x, x, sizeof(double)*SIZE_OF_HAND_DATA); //reset variables
			_x[i] = _x[i] + EPSILON; //change only one variable
			mapVecToData(_x, new_data.hand);
			double E_epsilon = calc_Energy(new_data,*d_ptr);
			g[i] = ((E_epsilon - *f) / EPSILON); //calc the gradient for this variable change
	return 0;

This function is called by tnc.c on every iteration of the search, the double x[] is the state of variables the search is now examining, double* f is the energy for this state, double g[] are the gradients (same size as x[]), and voide* state is a user-defined variable that can be carried along the process.
So what I did is simply changed the value of each parameter in turn, to test how it effects the energy in the system. I get a measure of the energy, then I subtract it from the “natural” setup (without any changes to parameters) energy measure, and I get the gradient for this parameter.
The energy function came out a bit different in the end:

static double calc_Energy(DATA_FOR_TNC& d, DATA_FOR_TNC& orig_d) {
	double _sum = 0.0;
	//external energy: how close are the joints to the hand blob? (how well do they fit to it)
	vector<Point2d> joints;
	Mat tips(5,1,CV_64FC2);
	for (int j=0; j<5; j++) {
		FINGER_DATA f = d.hand.fingers[j];
		Point2d _newTip = newTip(f,d.hand,joints); //get joints for this finger
		for (int i=0; i<tmp.size(); i++) { //for each joint find how far it is from the blob
			double ds = pointPolygonTest(d.contour, tmp[i]+getHandOrigin(d.hand), true);
			ds += 5;
			ds = 1 * ((ds < 0) ? -1 : 1) * (ds*ds) ;
			_sum -= (ds > 0) ? 0 : 100*ds;
		}<Point2d>(j,0) = _newTip;
	//lazyness of fingers - joints should strive to be as they were in the natural pose
	vector<double> _angles;
//	for (int j=0; j<5; j++) {
//		FINGER_DATA f = d.hand.fingers[j];
//		FINGER_DATA of = orig_d.hand.fingers[j];
////		_angles.push_back(f.a - of.a);
//		for (int i=0; i<f.joints_d.size(); i++) {
////			_angles.push_back(f.joints_a[i] - of.joints_a[i]);
//			_angles.push_back(f.joints_d[i] - of.joints_d[i]);
//		}
//	}
	_angles.push_back(d.hand.a-orig_d.hand.a); //the angle of the hand should be as it was before
	_sum  += 10000*norm(Mat(_angles));
	if(_sum < 0) return 0;
	return _sum;

You’ll notice the commented out section. The “laziness of fingers” turned out not to give good results… A different metric is needed! I have not found it yet, maybe you have a good idea?
Starting tnc.c is very simple: Allocating the vectors for X and gradients, initializing the model from the blob, and calling the simple_tnc convenience method. simple_tnc starts tnc with some default parameters that don’t affect the outcome (at least in my tries).

void estimateHand(Mat& mymask) {
	double _x[SIZE_OF_HAND_DATA] = {0};
	Mat X(1,SIZE_OF_HAND_DATA,CV_64FC1,_x);
	double f;
	Mat gradients(Size(SIZE_OF_HAND_DATA,1),CV_64FC1,Scalar(0));
	initialize_hand_data(d, mymask);
	mapDataToVec((double*), d.hand);
	simple_tnc(SIZE_OF_HAND_DATA, (double*), &f, (double*), my_f, (void*)&d, 1, 0);
	mapVecToData((double*), d.hand);
	d.hand.origin = getHandOrigin(d.hand); //move to new position

Results and Discussion

Here are my results so far:

It’s not perfect, but it’s a start. Tracking and estimating open hand is pretty good, with some orientation change as well. But when the fingers are closed… that’s where problems start.
Sometimes the joints “hover” over the black area to “land” in a white area so they “fit”, but they should not do that. One easy thing to do to counter this is to measure the distance of the whole bone, and not just the joint.
The model right now doesn’t use all the joints possible, because it is too heavy computationally. Plus the energy does not depend (or change) the angle of the fingers. So this is a very very simple model of a hand…
But, it is a good start! All the other stuff I have seen online is just basic high-curvature points counting and color-based or feature-based segmentation and tracking… My model actually tries to fit an articulate and precise model of a hand to the image.

How did you get such nice blobs?!

You ask. They are beautiful aren’t they… nice and clean, easy for tracking and model fitting. It’s no magic though…
Well, I took part of a project in the Media Lab, called DepthJS, that uses the MS Kinect to control web pages. I wrote the computer-vision part. So all the code is there, you can grab it, I just plugged it into this little project. Basing off this very simple example of using OpenCV2.X and libfreenect.
Wow, this was a longie.. I hope you learned something and got inspired. I got to do a second overview of the project, and I’m inspired. Inspiration all around!
Code is obviously yours for the taking:
Please contribute your own views, thoughts, code, rants in the comments and github page.

17 replies on "Hand gesture recognition via model fitting in energy minimization w/OpenCV"

Hi Roy,
If i am compiling the shared code on Visual Studio 2005, i am getting the following errors in the main.cpp file. Can you please let me know in case i am doing some mistake while compiling. Your help will be greatly appreciated.
error C2678: binary ‘*’ : no operator found which takes a left-hand operand of type ‘int’ (or there is no acceptable conversion) line-64
error C2678: binary ‘+’ : no operator found which takes a left-hand operand of type ‘cv::Mat’ (or there is no acceptable conversion) line-248
error C2664: ‘cv::Mat::Mat(const cv::Mat &)’ : cannot convert parameter 1 from ‘cv::Point_’ to ‘const cv::Mat &’ line-404
error C2664: ‘cv::Mat::Mat(const cv::Mat &)’ : cannot convert parameter 1 from ‘cv::Point_’ to ‘const cv::Mat &’ line-416
error C2660: ‘cv::namedWindow’ : function does not take 1 arguments line-449

Follow the instructions of the compiler… look at these lines and try to work around the problems.
Perhaps your OpenCV version is not up to date, so some of the matrix operators (“*” and “+”) are not defined properly. Try updating OpenCV

Hi Roy,
Thanks a lot for your suggestion. I compiled the code with OpenCV 2.2 and it works fine.
Actually i tried with a number of sample images and what i find is that your code works fine when the background is black. Whenever the hand is present in a non-black background, the hand detection is not proper. Do you agree with me. If yes, is there any solution to solve this problem.

hi,roy,now ,i encounter there problems ,1>graphcut.obj : error LNK2001: 无法解析的外部符号 “public: __int64 __thiscall GCoptimization::compute_energy(void)” (?compute_energy@GCoptimization@@QAE_JXZ)
1>graphcut.obj : error LNK2001:can’t analysis the extern symbol “public: void __thiscall GCoptimization::setDataCost(int,int,int)” (?setDataCost@GCoptimization@@QAEXHHH@Z)
1>graphcut.obj : error LNK2001: can’t analysis the extern symbol “public: __int64 __thiscall GCoptimization::expansion(int)” (?expansion@GCoptimization@@QAE_JH@Z)
1>graphcut.obj : error LNK2001: can’t analysis the extern symbol “public: void __thiscall GCoptimizationGridGraph::setSmoothCostVH(int *,int *,int *)” (?setSmoothCostVH@GCoptimizationGridGraph@@QAEXPAH00@Z)
1>graphcut.obj : error LNK2001: can’t analysis the extern symbol “public: virtual __thiscall GCoptimizationGridGraph::~GCoptimizationGridGraph(void)” (??1GCoptimizationGridGraph@@UAE@XZ)
1>graphcut.obj : error LNK2001: can’t analysis the extern symbol “public: __thiscall GCoptimizationGridGraph::GCoptimizationGridGraph(int,int,int)” (??0GCoptimizationGridGraph@@QAE@HHH@Z)
can u give me some ideas

sorry ,i send the wroth place, the problem is in the GraphCut,thank very much for your hard work

please can u help with tutorial if possible a material on hw to develop an object recognition application of android using java and opencv

Hi Roy,
If i am compiling the shared code on Visual Studio 2010, i am getting the following errors in the main.cpp file which I called HnadGesture.cpp . Can you please let me know in case i am doing some mistake while compiling. Your help will be greatly appreciated.
HandGesture.obj : error LNK2019: unresolved external symbol “class cv::Scalar_ __cdecl refineSegments(class cv::Mat const &,class cv::Mat const &,class cv::Mat &,class std::vector<class cv::Point_,class std::allocator<class cv::Point_ > > &,class std::vector<class cv::Point_,class std::allocator<class cv::Point_ > > &,class cv::Point_ &)” (?refineSegments@@YA?AV?$Scalar_@N@cv@@ABVMat@2@0AAV32@AAV?$vector@V?$Point_@H@cv@@V?$allocator@V?$Point_@H@cv@@@std@@@std@@2AAV?$Point_@H@2@@Z) referenced in function “void __cdecl initialize_hand_data(struct data_for_tnc &,class cv::Mat const &)” (?initialize_hand_data@@YAXAAUdata_for_tnc@@ABVMat@cv@@@Z)
1>HandGesture.obj : error LNK2019: unresolved external symbol _simple_tnc referenced in function “void __cdecl estimateHand(class cv::Mat &)” (?estimateHand@@YAXAAVMat@cv@@@Z)

Looks like you cannot link some functions. That’s strange because the functions are in the code.
Check that you are indeed compiling bg_fg_blobs.cpp and tnc.c, they contain these two functions you miss.

Thank you very much Roy,
I’m almost a beginner in Opencv and doing a biometric project
really I’m exhausted I REPEATED the same code downloads and compiling many times and still having the same errors
main.obj : error LNK2019: unresolved external symbol “class cv::Scalar_ __cdecl refineSegments(class cv::Mat const &,class cv::Mat const &,class cv::Mat &,class std::vector<class cv::Point_,class std::allocator<class cv::Point_ > > &,class std::vector<class cv::Point_,class std::allocator<class cv::Point_ > > &,class cv::Point_ &)” (?refineSegments@@YA?AV?$Scalar_@N@cv@@ABVMat@2@0AAV32@AAV?$vector@V?$Point_@H@cv@@V?$allocator@V?$Point_@H@cv@@@std@@@std@@2AAV?$Point_@H@2@@Z) referenced in function “void __cdecl initialize_hand_data(struct data_for_tnc &,class cv::Mat const &)” (?initialize_hand_data@@YAXAAUdata_for_tnc@@ABVMat@cv@@@Z)
main.obj : error LNK2019: unresolved external symbol _simple_tnc referenced in function “void __cdecl estimateHand(class cv::Mat &)” (?estimateHand@@YAXAAVMat@cv@@@Z)
I’m using win XP with opencv2.2 and visual studio 2010
Is there any use of the cmakelist.txt file or the cmake folder in my case?
If there are any suggestions please tell me
So sorry for the disturbance and thank you very much in advance

Indeed you should be working with the CMakeLists.txt to compile the program.
Download CMake, run it ad provide the directory where the code exists. After you “Configure” and “Generate” it should create a good Visual Studio solution that will compile the code correctly.

I don’t have a Kinect. Will this work on a regular webcam?
And thanks for sharing the code!

@Laurenzo, Roy
Yes, it works fine with a regular webcam, compiled well on linux (didn’t use cmake). You can also do a simple HSV segementation with, in your capture loop in the main:
where img is the captured image, hsvframe and gray are Mat.
For my hand I used:
Hue_lo = 0;Hue_hi = 10;Sat_lo = 50;Sat_hi = 170;
(will depend on your skin color, environment, white balance, etc…).
You can fine tune the thresholds above with the help of the openCV HS histogram, link provided by Roy.

