Converting 2D golf swing sequences into 3D models

Wednesday, January 30, 2008

Detection Using a Kind of Hysteresis

We are experimenting with an approach that makes use of two Hough transforms (we'll call them HI and LO) -- one with higher thresholds for line segment length and proximity, and one with lower thresholds. [Our use of HI and LO probably isn't the best choice. HI corresponds to a strict threshold that potentially throws away good signal. LO corresponds to a looser threshold that may let in more bad signal.]

Aside: OpenCV's cvHoughLines2 returns a traditional rho-theta representation if passed the CV_HOUGH_STANDARD flag and finite line segments (with x-y coordinates) if passed the CV_HOUGH_PROBABALISTIC flag.

At a high level, our approach keeps line segments from HI when they exist, and when they don't, it keeps line segments from LO that are "near" the last detected club segment.

We make several simple implementation choices, but we still achieve decent results (demonstrated below):
  • We only keep the results from HI when there are exactly 2 segments. For the parameters we have chosen for HI and for the test video we are using, there are often exactly 2 segments corresponding to either edge of the club shaft. We can relax this by also allowing 1 (or 3) good segment(s) to serve as our club hypothesis, or 2 pairs of 2 segments that are co-linear but disconnected (e.g. because of an occlusion).
  • The first time a HI segment is recognized, we assume that this is a good starting position for the club. We compare the two endpoints of the segment and label the higher one as the original hand position. We then use this an anchor point to determine in all subsequent segments which end is the hand and which end is the club head. Of course, if the first HI segment is not a good match for the club shaft, the entire tracking algorithm will suffer.
  • We keep track of the last club segment detected by HI for use when a subsequent frame has no HI signal. When a frame has only LO signal, all of its individual segments are compared to the last HI segment. We choose the segment that minimizes a simple error metric, defined to be the difference in slope and in positions of hand and head points. A more sophisticated approach would be to incorporate a notion of the motion model of the club. This, however, seems like overkill, because the motion model is the thing we want to *learn* in a later phase.
Here are results from the approach described above, with green lines representing segments detected by HI, red lines representing the closest LO segments to the last HI segment detected, yellow dots representing our notion of hand location, and white dots representing our notion of club head location.


youtube

This does okay, except in long stretches of frames where no HI signal is observed. A simple way to address this problem is to keep track of the last segment detected, whether it be from HI or LO. This approach is demonstrated below, and does much better with long stretches of LO frames.



youtube

Notice that in the vast majority of frames, the line segment we output is at least co-linear with the club shaft, even though the length is often incorrect. We should be able to use color information to better hypothesize the length of the shaft.

Although our simple "tracking" approach works well on this test case, one possibility to improve accuracy is to maintain a window of the last k frames instead of just 1.

Friday, January 25, 2008

Working on Club Extraction, Part 2

To extract more features of the golf club, we ran the Hough transform on a lower threshold to produce these results:



As shown, the club is now detected in areas where it previously wasn't. In an attempt to get rid of unneeded lines in the frames, we iterated through each line in the lower threshold and only kept the lines that were close to the lines of the threshold we previously used. However, since our previous threshold had some frames with zero lines detected, we went ahead and kept all of the lines that the lower threshold produced for these frames.

With enough tweaking and threshold adjusting, we may be able to get even better results.

Wednesday, January 23, 2008

Working on Club Extraction

We have continued work on extracting the club in each frame, following the process used by Gehrig et al.
  • After computing the motion mask from before, we perform a morphological closing (2 iterations of dilation and erosion) to remove small gaps and help smooth edges.
  • We apply this mask to the original frame to isolate the moving parts in the original image.
  • After converting to grayscale, we run Canny edge detection.
  • Finally we run a Hough transform to detect line segments.
We've run these steps on our previous sample video with the following results:



As you can see, the line segment detection works pretty well when the club is moving slowly, but not when the club is moving more quickly and against a background with similar color.

We also ran this process on down-the-line and up-the-line videos, without changing any of the parameters to the edge and line detection algorithms:




We will continue to experiment with the parameters of the various stages (closing, Canny, Hough) to try to get better results for these sample videos. We will also record video with a more neutral background and higher contrast to the golf club to see how this segment detection performs.

Given more accurate results from Hough, the next step is to examine the segments discovered and merge close-to-parallel line segments. Once these parallel line segments are defined, we will trace them in the original images to try to detect the position of the clubhead and the golfer's hand by looking at color changes.

Wednesday, January 16, 2008

Motion Detection

Club tracking from monocular video has been successfully implemented and incorporated into commercial software, and this work is described here. In their slide deck, they describe the phases of their algorithm. We implemented the first phase -- of localizing the moving objects in a video -- with the same straightforward techniques they do.

They first compute the pixelwise difference of each consecutive pair of frames. Then pixelwise ANDing each consecutive pair of these "diffed" images yields a mask of just the objects in motion for each given frame. Put another way, for three consecutive frames A, B, and C, the motion detection mask for frame B is NOT(A AND B) AND NOT(B AND C).

The results of this algorithm are demonstrated on the following video (fixed camera position, 29fps, physical motion in slow motion).



Monday, January 14, 2008

Monocular vs. Binocular

Existing work has been done on recovering 3d positions from monocular images, with the camera facing the golfer during the swing. We are wondering if techniques operating on a monocular view would be able to accurately distinguish the following positions. Notice that from face-on view, the body positions are very similar, and the differences in wrist angles and length of the club (due to foreshortening) are not drastic. Looking at the same three positions from the down-the-line angle shows that the club is actually in drastically different positions, and these differences are critical when examining and analyzing a golf swing. We will need to investigate how existing monocular view algorithms perform on these sorts of test cases.






















(Note: the poses were re-enacted separately for the face-on and down-the-line shots)

Different Frame Rates

We took video of my golf swing with a point-and-shoot digital camera that records 30fps (left). I'm swinging a short club that's probably traveling around 75mph at the bottom of my swing. Notice that the frame rate is too low to accurately track the trajectory of the club. Typical swings with longer clubs are in the 100 mph range, so the motion blur will become even more pronounced.

We will use high frame rate digital cameras to record our videos, but before that we would like to get an idea of the type of results we will get. We artificially simulated the results of a higher frame rate camera by slowing down my swing. My normal swing with this club took about 2 seconds, so I made two more swings at about 4 and 8 seconds. Thus, we get an idea of what images from 60fps (center) and 120fps (right) cameras will look like.









Wednesday, January 9, 2008

Project Overview

Our goal is to build a system that allows a golfer to create a customized 3D model of his/her golf swing from the 2D video input of the golf swing itself. Having a motion capture session to record a golf swing is the most accurate way to obtain a 3D model of this motion, but few golfers have access to such expensive and uncommon mocap studios. To help a golfer realize some of the benefits of viewing his/her own swing as a 3D model, our system will import the swing video (hence the name) and try to reconstruct from the 2d images what the 3d motion looks like.

Because we are limiting the scope of this project to the motion of a golf swing, we will exploit the fact that any golf swing exhibits similarities. After we define the notion of a generic model, we will explore algorithms that identify key postures in the given 2D images and track these movements over time. Our intent is that restricting ourselves to the golf swing motion will avoid some of the difficulties of detecting arbitrary human body movement.

Although our long-term goal is to eventually extend the system for users to use their point-and-shoot cameras to produce their golf swing videos, our initial approach will contain a more ideal environment for video capture (in order to accomplish as much as we can). Because of the speed of a golf swing cannot be best represented with the typical point-and-shoot, we are planning to use cameras with frame rates on the order of 100 to 200 fps, and we will record in a room with a neutral, solid background color and neutral lighting.

We have not yet made the choice of processing monocular videos or videos corresponding to multiple angles of the same swing. Current work in vision-based motion capture has been done for both cases, and we need to further investigate this work before deciding. In the event that we choose to use multiple vantage points, we will then need to decide whether we will set up multiple cameras to record synchronously or if we will use alignment techniques to manually synchronize the cameras' videos.

Once the user has uploaded his video and the tracking algorithms have been run to identify the positions and angles of the golfer's skeleton, we will provide the user with a GUI that allows him/her to manually correct the skeleton. This marker phase will allow the user to overcome any shortcomings the tracking algorithms exhibit, and still allow a useful 3D model to be generated.

The resulting skeleton from the tracking/user-calibration phase will then be mapped on to a base 3D golf swing model, which we will either obtain through our own mocap recording sessions or by obtaining sample data from software companies that specialize in mocap systems. One such example can be found at http://www.tmplabs.com/.

Finally, we will design a metric for evaluating the performance of our system. One possibility is to compare the difference in specific joint angles from the 2D images with those from the 3D model over time. Another possibility is to compare the result of a 3D model produced by SwingImp with a model obtained of the same swing through a mocap session. This would present the complication of having to record the 2D videos and the mocap data simultaneously, however.

As you can see, we have a busy quarter ahead of us!