Converting 2D golf swing sequences into 3D models

Wednesday, March 12, 2008

End of Downswing

We haven't been considering the end of the downswing, so are results for that portion are expectedly erratic. But for a few cases, it does an okay job.






Demos

http://www.cs.ucsd.edu/~rchugh/swingimp/

Miscellaneous Video Tools

To create a video from images:

mencoder "mf://*.png" -vf scale=360:240 -mf fps=29 -o out.avi -ovc lavc -lavcopts vcodec=msmpeg4v2:vbitrate=800

To embed a movie directly in a webpage:

http://cit.ucsf.edu/embedmedia/step1.php

Monday, March 10, 2008

Oops

Extreme (part 2)

Over the weekend we re-wrote the early parts of our pipeline (up until Canny, right before Hough) to streamline them and make them a little faster. In the process, we introduced a bug where the masks that were being applied to each frame were offset by one. This led to a degradation in the quality of edge detection and consequently the Hough lines. Now that we've fixed that bug, the Hough lines are pretty much back to where they were before. And as a result, our extreme hypothesis checker shows better results.



Notice that for this video, the results are what we would like to always have: that the topmost point is at the highest point in the swing and that the bottommost point is the shadow of the club head at the bottom of the swing.

Sunday, March 9, 2008

Extreme Hypothesis Checking

In deciding whether or not to guide RANSAC towards fitting extreme points in all directions, we have examined the extreme points obtained on some of our sample videos. Out of the five videos we tried this on, in most cases the four extreme points corresponded to the clubhead. In one case, the topmost point was along the shaft not all the way to the clubhead.





Of course, to really validate that the extreme points are always good, we need to test on many more videos with varying conditions, which our sample videos do not have.

(face5, gw3)

Monday, March 3, 2008

Partial Downswing Fitting

We're experimenting a little with fitting the downswing as two separate parts. We're manually identifying the "crossover" frame, the point at which the club head passes 2*Pi (the club directly to the right of the reference point).

The first part of the downswing is being fit a little better.





The second part of the downswing is too gruesome to show. The curve is completely erratic. One likely source of error is that at the end of the swing, the club head is closer to the reference point than the hand, so we choose the wrong end of the line segment to be the club head. An easy way to solve this problem is to only track up to a certain point, such as Pi/2, where the club is directly over the reference point (and is still farther than the hand).

Saturday, March 1, 2008

Something Like Tomographic Reconstruction

The motion history masks have a lot of good signal that we want to take advantage to supply hints to the RANSAC fitting about where the boundaries of the swing trajectories are. We are using a projection slice approach to try and get a gift wrap fit around the border of the swing.

We first sum up the the number of white pixels in each column. Then we scan from left to right to compute two things:
  • The index of the leftmost column with at least THRESHOLD% pixels white.
  • The index of the rightmost column with at least THRESHOLD% pixels white.
These indices give us two lines representing the boundary of the object projected onto the x-axis.

We rotate the image and repeat this process, giving us similar projections along multiple directions. The polygon enclosed by these boundaries is an approximation of the boundary of the object.

The following are the results of running this process on the binary upswing motion history image from a previous post. The thresholds for the images below are 3%, 6%, and 9%, and we evaluate projections in 16 directions.







The following is the result for the corresponding upswing with a threshold of 5%.



Notice that the intensity of the mask is duller than the original. This is probably an artifact from the way we are manually rotating images, but it doesn't seem to affect the results.

Also note we have not yet computed the enclosing polygon; we are simply drawing the boundary lines from each projection.

Friday, February 29, 2008

Weighted Curve Fitting

In an attempt to fit more points along the trajectory, we tried a simple weighting scheme to favor points with larger magnitudes (distance from the origin). We define a set of concentric rings around the origin with radii of multiples of 50 pixels. Then, for a point in ring with radius n*50, we use n*THRESHOLD as the threshold for determining whether or not a point agrees with a model generated by RANSAC (where THRESHOLD is the threshold we were using for all points before).

The results using this scheme:





The model fit for the upswing is actually worse than it was before, but the downswing curve follows the actual trajectory better for longer.

In general, this doesn't seem like a promising approach. One related possibility is to take the extreme points (the left-most, bottom-most, etc) and force RANSAC to fit at least those points. But this does not sound like a robust approach either.

Wednesday, February 27, 2008

Simple Motion History Images

The following images correspond to the first video from the previous two posts (the one in which the golfer is wearing red).

In a previous phase of the pipeline, we've isolated the moving pixels in each frame as a binary mask. To generate these images, we iterate over the binary mask for each frame and count the number of white pixels at each location (i,j).

The first row contains binary motion histories for the up- and downswing, where a white pixel means at least one frame mask had that particular pixel set. For the images in the second row, we've normalized the intensity of white relative to the maximum number of votes any one pixel had.























The motion history images for the second video (not shown here) are similar in quality, with slightly less background noise.

Space and Time Estimation

The trajectory estimation computes magnitude as a function of angle. We have also now computed angle as a function of time (frames * seconds/frame). Following the Gehrig paper, we use a 3rd degree polar polynomial for the upswing and a 5th degree one for the downswing. The angles used to compute f(t) are those of the inliers from the best trajectory fitting.

These results are a decent start, especially for the upswing.



Problems with Downswing Fitting

We are still trying to improve the results of the downswing curve fitting. A couple of fundamental things that still need work:

  • More accurate identification of transition

  • More reliably offsetting angles in the downswing by 2*Pi where appropriate


A more experimental possibility we may try stems from the following observation: none of the bad clubhead hypotheses lay outside of the actual trajectory. In fact, there are sometimes many good hypotheses on the actual trajectory that are not in the consensus set of the final model. For example:






Two things that might help the fitting process:

  • Weight the fitting towards points with larger magnitudes.

  • Weight the fitting towards points later in the swing, because these two images show that most inliers are early in the downswing.

Saturday, February 16, 2008

Upswing Curve Fitting

Some preliminary results from fitting degree-4 polar curves to upswing frames:



Wednesday, February 13, 2008

More Club Detection

We've worked on club detection to make it work for a better sample video of a full golf swing (in slow motion). We adjusted the threshold for the LO Hough transform, so that it still gets good signal around the club but not as much erroneous signal. We also realized that for the HI lines, we were getting a lot of good signal but we were pruning too much of it out. We've made the filtering process for HI a little better, but there are still ways to make it better to retain even more HI signal (e.g, working harder to merge parallel and disconnected segments).

We've also estimated the top of the backswing using a simple metric. For each frame, we consider the best single hypothesis for the club position. We then look for places where the 2nd derivative between 2 frames is 0, and choose these frames as possiblilities for the transition from upswing to backswing. We then use a weighting scale to assign each of these possibilities a score, based on distance from the middle of the sequence and the number of surrounding LO hypotheses. We then choose the frame with the lowest score.

The changes to Hough thresholding and detection of the transition work pretty well on the following video.



Upswing:



Downswing:



The hypotheses for the clubhead (marked by white circles) are fairly accurate, so we should be able to get a good polynomial approximation using least squares and RANSAC.

We also need to develop a library of test videos, with different conditions, so that we can be confident that the refinements we are making are robust.

Wednesday, February 6, 2008

Linear Fit RANSAC Test



There are 100 points around the line y=x, randomly perturbed a small amount. There are also 50 points scattered uniformly at random. The yellow line represents the line of best fit using least squares, and the blue line represents the line of best fit among consensuses of at least 70 points. The red points are those in consensus about the best model estimated by RANSAC.

Wednesday, January 30, 2008

Detection Using a Kind of Hysteresis

We are experimenting with an approach that makes use of two Hough transforms (we'll call them HI and LO) -- one with higher thresholds for line segment length and proximity, and one with lower thresholds. [Our use of HI and LO probably isn't the best choice. HI corresponds to a strict threshold that potentially throws away good signal. LO corresponds to a looser threshold that may let in more bad signal.]

Aside: OpenCV's cvHoughLines2 returns a traditional rho-theta representation if passed the CV_HOUGH_STANDARD flag and finite line segments (with x-y coordinates) if passed the CV_HOUGH_PROBABALISTIC flag.

At a high level, our approach keeps line segments from HI when they exist, and when they don't, it keeps line segments from LO that are "near" the last detected club segment.

We make several simple implementation choices, but we still achieve decent results (demonstrated below):
  • We only keep the results from HI when there are exactly 2 segments. For the parameters we have chosen for HI and for the test video we are using, there are often exactly 2 segments corresponding to either edge of the club shaft. We can relax this by also allowing 1 (or 3) good segment(s) to serve as our club hypothesis, or 2 pairs of 2 segments that are co-linear but disconnected (e.g. because of an occlusion).
  • The first time a HI segment is recognized, we assume that this is a good starting position for the club. We compare the two endpoints of the segment and label the higher one as the original hand position. We then use this an anchor point to determine in all subsequent segments which end is the hand and which end is the club head. Of course, if the first HI segment is not a good match for the club shaft, the entire tracking algorithm will suffer.
  • We keep track of the last club segment detected by HI for use when a subsequent frame has no HI signal. When a frame has only LO signal, all of its individual segments are compared to the last HI segment. We choose the segment that minimizes a simple error metric, defined to be the difference in slope and in positions of hand and head points. A more sophisticated approach would be to incorporate a notion of the motion model of the club. This, however, seems like overkill, because the motion model is the thing we want to *learn* in a later phase.
Here are results from the approach described above, with green lines representing segments detected by HI, red lines representing the closest LO segments to the last HI segment detected, yellow dots representing our notion of hand location, and white dots representing our notion of club head location.


youtube

This does okay, except in long stretches of frames where no HI signal is observed. A simple way to address this problem is to keep track of the last segment detected, whether it be from HI or LO. This approach is demonstrated below, and does much better with long stretches of LO frames.



youtube

Notice that in the vast majority of frames, the line segment we output is at least co-linear with the club shaft, even though the length is often incorrect. We should be able to use color information to better hypothesize the length of the shaft.

Although our simple "tracking" approach works well on this test case, one possibility to improve accuracy is to maintain a window of the last k frames instead of just 1.

Friday, January 25, 2008

Working on Club Extraction, Part 2

To extract more features of the golf club, we ran the Hough transform on a lower threshold to produce these results:



As shown, the club is now detected in areas where it previously wasn't. In an attempt to get rid of unneeded lines in the frames, we iterated through each line in the lower threshold and only kept the lines that were close to the lines of the threshold we previously used. However, since our previous threshold had some frames with zero lines detected, we went ahead and kept all of the lines that the lower threshold produced for these frames.

With enough tweaking and threshold adjusting, we may be able to get even better results.

Wednesday, January 23, 2008

Working on Club Extraction

We have continued work on extracting the club in each frame, following the process used by Gehrig et al.
  • After computing the motion mask from before, we perform a morphological closing (2 iterations of dilation and erosion) to remove small gaps and help smooth edges.
  • We apply this mask to the original frame to isolate the moving parts in the original image.
  • After converting to grayscale, we run Canny edge detection.
  • Finally we run a Hough transform to detect line segments.
We've run these steps on our previous sample video with the following results:



As you can see, the line segment detection works pretty well when the club is moving slowly, but not when the club is moving more quickly and against a background with similar color.

We also ran this process on down-the-line and up-the-line videos, without changing any of the parameters to the edge and line detection algorithms:




We will continue to experiment with the parameters of the various stages (closing, Canny, Hough) to try to get better results for these sample videos. We will also record video with a more neutral background and higher contrast to the golf club to see how this segment detection performs.

Given more accurate results from Hough, the next step is to examine the segments discovered and merge close-to-parallel line segments. Once these parallel line segments are defined, we will trace them in the original images to try to detect the position of the clubhead and the golfer's hand by looking at color changes.

Wednesday, January 16, 2008

Motion Detection

Club tracking from monocular video has been successfully implemented and incorporated into commercial software, and this work is described here. In their slide deck, they describe the phases of their algorithm. We implemented the first phase -- of localizing the moving objects in a video -- with the same straightforward techniques they do.

They first compute the pixelwise difference of each consecutive pair of frames. Then pixelwise ANDing each consecutive pair of these "diffed" images yields a mask of just the objects in motion for each given frame. Put another way, for three consecutive frames A, B, and C, the motion detection mask for frame B is NOT(A AND B) AND NOT(B AND C).

The results of this algorithm are demonstrated on the following video (fixed camera position, 29fps, physical motion in slow motion).



Monday, January 14, 2008

Monocular vs. Binocular

Existing work has been done on recovering 3d positions from monocular images, with the camera facing the golfer during the swing. We are wondering if techniques operating on a monocular view would be able to accurately distinguish the following positions. Notice that from face-on view, the body positions are very similar, and the differences in wrist angles and length of the club (due to foreshortening) are not drastic. Looking at the same three positions from the down-the-line angle shows that the club is actually in drastically different positions, and these differences are critical when examining and analyzing a golf swing. We will need to investigate how existing monocular view algorithms perform on these sorts of test cases.






















(Note: the poses were re-enacted separately for the face-on and down-the-line shots)

Different Frame Rates

We took video of my golf swing with a point-and-shoot digital camera that records 30fps (left). I'm swinging a short club that's probably traveling around 75mph at the bottom of my swing. Notice that the frame rate is too low to accurately track the trajectory of the club. Typical swings with longer clubs are in the 100 mph range, so the motion blur will become even more pronounced.

We will use high frame rate digital cameras to record our videos, but before that we would like to get an idea of the type of results we will get. We artificially simulated the results of a higher frame rate camera by slowing down my swing. My normal swing with this club took about 2 seconds, so I made two more swings at about 4 and 8 seconds. Thus, we get an idea of what images from 60fps (center) and 120fps (right) cameras will look like.









Wednesday, January 9, 2008

Project Overview

Our goal is to build a system that allows a golfer to create a customized 3D model of his/her golf swing from the 2D video input of the golf swing itself. Having a motion capture session to record a golf swing is the most accurate way to obtain a 3D model of this motion, but few golfers have access to such expensive and uncommon mocap studios. To help a golfer realize some of the benefits of viewing his/her own swing as a 3D model, our system will import the swing video (hence the name) and try to reconstruct from the 2d images what the 3d motion looks like.

Because we are limiting the scope of this project to the motion of a golf swing, we will exploit the fact that any golf swing exhibits similarities. After we define the notion of a generic model, we will explore algorithms that identify key postures in the given 2D images and track these movements over time. Our intent is that restricting ourselves to the golf swing motion will avoid some of the difficulties of detecting arbitrary human body movement.

Although our long-term goal is to eventually extend the system for users to use their point-and-shoot cameras to produce their golf swing videos, our initial approach will contain a more ideal environment for video capture (in order to accomplish as much as we can). Because of the speed of a golf swing cannot be best represented with the typical point-and-shoot, we are planning to use cameras with frame rates on the order of 100 to 200 fps, and we will record in a room with a neutral, solid background color and neutral lighting.

We have not yet made the choice of processing monocular videos or videos corresponding to multiple angles of the same swing. Current work in vision-based motion capture has been done for both cases, and we need to further investigate this work before deciding. In the event that we choose to use multiple vantage points, we will then need to decide whether we will set up multiple cameras to record synchronously or if we will use alignment techniques to manually synchronize the cameras' videos.

Once the user has uploaded his video and the tracking algorithms have been run to identify the positions and angles of the golfer's skeleton, we will provide the user with a GUI that allows him/her to manually correct the skeleton. This marker phase will allow the user to overcome any shortcomings the tracking algorithms exhibit, and still allow a useful 3D model to be generated.

The resulting skeleton from the tracking/user-calibration phase will then be mapped on to a base 3D golf swing model, which we will either obtain through our own mocap recording sessions or by obtaining sample data from software companies that specialize in mocap systems. One such example can be found at http://www.tmplabs.com/.

Finally, we will design a metric for evaluating the performance of our system. One possibility is to compare the difference in specific joint angles from the 2D images with those from the 3D model over time. Another possibility is to compare the result of a 3D model produced by SwingImp with a model obtained of the same swing through a mocap session. This would present the complication of having to record the 2D videos and the mocap data simultaneously, however.

As you can see, we have a busy quarter ahead of us!