Motion Capture Driven Augmented Clothing System

Posted on Mar 14, 2025 in Art

Augmented Dressed Body System Controlled by Motion Capture Data

Khaled F. Hussain¹, Adel A. Sewisy¹ and Islam T. El-Gendy¹

¹Computer Science Department, Assuit University, Egypt

Abstract. Augmenting deformable surfaces like cloth and body in real video is a challenging task. This paper presents a system for cloth and body augmentation in a single-view video. The system allows users to change their cloth either by changing the color, the texture, or the whole cloth. It augments the user with virtual clothes. As a result, users can enjoy changing their cloth with any other cloth they want. As a prerequisite, the user needs to wear a special suit and enter through our motion capture system that captures the movements of the user. From the captured data, an animated 3D character model is created, which will serve as the new body. The model is rendered with the new cloth but without the head. We extract the real face of the user and place it on the virtual model. This system can be used in film production and advertisement.

Keywords: Camera registration, Cloth simulation, Color transfer, Matting, Motion capture system, Segmentation, Video editing.

Introduction

Combining computer-generated content or virtual objects with real video is used with great interest in many applications such as movies and augmented reality [1]. AR (Augmented Reality) is the integration of computer-generated objects with live video in real-time in a way that the viewer cannot tell the difference between the real and augmented world. Film production takes place all over the world in a huge range of economic, social, and political contexts, and using a variety of technologies and cinematic techniques. A very important issue in any film production or advertising is the costumes used by actors or actresses, and the cloth design. Making various and different costumes is essential for making a good film. However, the budget needed for this task usually is very high, and sometimes it is even impossible to change clothes while shooting the scene without compromising the flow of the scene.

We developed an off-line system which can solve this problem, where users need only to wear a single suit in the entire production. The user performs any kind of movement (e.g., dance, fashion walk, etc.) within the capture volume of our motion capture system, which consists of 24 cameras and records the user’s movements. An animated 3D character model is created similar to the user with the new cloth on it but without a head. An outside camera records the user’s movements, in which the user’s head is extracted and placed on the animated 3D character model with the new clothes, and the augmented body. The combined real head and virtual body can be placed in any background. Our system can work on smaller 8-camera systems that are more appropriate in stores because they are cheaper.

The contribution of this work is a complete system to combine real video with a virtual character animated with motion capture data. The possibility to put clothes varying in style and color, and even change the look of the users to make them for example slim or athletic, make the system useful for applications like virtual try-on and digital film production.

Previous Work

Cloth tracking is a difficult and a challenging problem because of its flexibility and ability to self-occlude. In order to perform augmentations on a non-rigid object such as a flexible piece of cloth, a mesh representation of the cloth is obtained [2]. To achieve correct folds and wrinkles in virtual cloth, complex physical simulations are used.

A system that gives users the ability to interactively control a 3D model of themselves at home using a commodity depth camera is made by [3]. It augments the model with virtual clothes that can be downloaded. The user in this system needs to pass through or enter a multi-camera setup, which captures him or her in a fraction of a second. A 3D model is created from the captured data. Then, the model is transmitted to the user’s home system to serve as a realistic avatar for the virtual try-on application. The drawback is that it can only be used in virtual try-on, and it doesn’t look realistic because of the virtual face.

In [4], real-time 2D augmentations on non-rigid objects, such as clothing, are presented, in which a technique is included to establish common illumination and render augmentations with correct real-world shadows.

Another way to augment cloth involves sparse cloth-tracking in video images using a vision-based marker system with temporal coherence [5], with an image-based method to automatically acquire real-world illumination and shadows from the input frame.

A method of 3D structure reconstruction and blending using a single-camera video stream is used to insert or modify models in a real video stream [6]. The approach is based on a simplification of camera parameters and the use of projective geometry without camera calibration.

A method for augmenting deformable surfaces, like cloth, in single-view video with realistic geometric deformation and photometric properties without a 3D reconstruction of the surface for augmented reality applications is presented in [7], where retexturing the retrieval of deformation and motion in the image plane is sufficient as the augmented surface is rendered from the same point of view as the original one.

An extended optical flow version is used in [1, 8] together with mesh-based models and a specific color model to estimate deformation and photometric parameters simultaneously, that not only accounts for changes in the light intensity, but also in the light color. These methods work only if the cloth has a suitable texture and place only texture on the cloth, not augmenting the whole cloth. External occlusions can be accounted for in an occlusion map, showing and classifying whether a pixel is visible or not, based on local texture patch color distributions and a global occlusion color distribution.

System Overview

Our approach consists of two major parts (see Figure 1). First, the virtual part in which the 3D character model is generated with the free MakeHuman (http://www.makehuman.org) software. The animation of the model is taken from the user’s movements in a 24-camera system, which records the movements of the reflective markers on the suit. The animated 3D character model can be rendered from any viewpoint. The virtual model is rendered with the same viewpoint of the outside camera with a new virtual cloth and without the head. Section 4 describes the construction and registration part in detail.

The second part is the outside camera and its output, where we segment the user’s body, extract the real head of the user, and finally, place that head on the virtual body. Section 5 describes the segmentation part in detail, while section 6 describes the final part of head placement.

Avatar Construction and Registration

This section describes the construction of the 3D character model and the steps of scene registration.

The Animated 3D Character Model

For rendering a video of the user and augmenting him or her with different clothes or deforming his or her body, a virtual avatar is needed. Furthermore, the user’s movements need to be captured, so that the avatar has the same movements of the user. We use the Naturalpoint (http://www.naturalpoint.com) motion capture system to capture the user’s movements, which consists of a 6 × 6 × 3 meter³ capture volume where 12 cameras are mounted top, and 12 cameras are mounted below them as shown in figure 2a. All cameras are pointed towards the centre of the capture volume, where the user is allowed to move freely. Every six cameras are connected together in a USB Hub, every USB Hub is connected with the next one with a synchronization cable, and the four hubs are connected to a single PC via USBs. The cameras capture images with a resolution of 640×480 pixels. All cameras are calibrated using ARENA (http://www.naturalpoint.com/optitrack/products/arena) software.

The freely available MakeHuman application allows users to create and modify human avatars by specifying a wide range of parameters, like facial features, gender, body part proportions etc. The 3ds Max (http://usa.autodesk.com/3ds-max) software can import the MakeHuman DAE file format. In order to create the animated 3D character model, the user’s movements are captured, and the 2D points are trajectoried into 3D points. The system exports the animation into BVH file format. The 3ds Max software can easily read the BVH file into a rigged body.

Figure 2 shows a frame of an outside camera and a virtual scene where the 3D character model is created and has the same movements as the user. The outside camera is a fixed single camera placed anywhere in the scene. If the user wants to change the camera’s viewpoint, he/she must do the calibration process again. The outside camera is used at the same time and in the same place as the tracking system.

We used the 3ds Max cloth modifier to perform the cloth simulation. We design the new virtual cloth which will replace the real one as shown in figure 2b. The cloth itself can be designed as pieces of garments that are sewed together to fit the avatar body, and then simulated for each frame.

Registration

The purpose is to make the world’s coordinate system of the outside camera the same as the world’s coordinate system of the 24-camera system. The virtual scene is modeled with some objects similar to the real scene. We already made the 3D character model with the same movements, but we must create a virtual camera that has the same point of view as the outside camera which is the registration process. The purpose of the registration in our system is to align the virtual body with the real body. Without accurate registration, the output will be visually inconsistent.

In order to make a correct registration with the virtual scene, we prepare the lab with calibration square on the floor and two boxes in the back of the lab, so we can reference the 3 markers on the calibration square and the corners of the boxes. The virtual scene contains the positions of the 3 markers (3 small spheres) and the boxes with virtual ones of the same dimensions and the exact distance from the model as in the real scene. We have used a registration method [9] that aligns objects by providing a number of points (r_i, c_i) in image coordinate system whose location is (x_i, y_i, z_i) in the world coordinate system. The camera’s intrinsic parameters (r₀, c₀) (principal point coordinates), (f_u, f_v) (focal length), and image distortion coefficients (radial and tangential distortions) have been determined and are fixed; thus simplifying the system of equations needed to be solved. Given the points in the image coordinate system and the corresponding points in the world coordinate system, the transformation (R, T) between the 24-camera world’s coordinate system and the outside camera’s coordinate system is calculated

where R is the rotation matrix, and T is the translation vector. A constrained optimization technique is used that maintains the orthonormality of rotation matrix R.

Segmentation

The virtual body is now similar to the real body with the same viewpoint. Next, we segment the user from the background, extract the head, and mark the virtual neck curve.

User Segmentation

We extract the body of the user from the video using Gaussian mixture models (GMMs) [10]. Simple difference matte key can be used, in which the matte is generated by taking the absolute value of the difference between two images (video frames), one with the item of interest present and an identical one without the item of interest. The problem is that in the real world, this rarely happens as the lighting conditions may change.

Hence, we used the GMMs in which the Gaussian distribution for d dimensions of a vector a = (a¹, a², … , a^d)^T is defined by:

a² = b² + c². (1)

where µ is the mean, and ∑ is the covariance matrix of the Gaussian.

The probability given in a mixture of K Gaussians is:

a² = b² + c². (1)

where w_j is the prior probability (weight) of the jth Gaussian. The first few frames represent the background and they are used to initialize the GMMs (figure 3a shows a background frame). The GMMs are updated for each pixel, so even when the lighting conditions change the system can still detect the foreground. Figure 3 shows the background subtraction using the GMMs.

References

[1] Anna Hilsmann, David C. Schneider, and Peter Eisert. Technical section: Realistic cloth augmentation in single view video under occlusions. Comput. Graph., 34(5):567574, October 2010.

[2] R. Bridson, S. Marino, and R. Fedkiw. Simulation of clothing with folds and wrinkles. In Proceedings of the 2003 ACM SIGGRAPH/Eurographics symposium on Computer animation, SCA 03, pages 2836, Aire-la-Ville, Switzerland, Switzerland, 2003.

[3] Stefan Hauswiesner, Matthias Straka, and Gerhard Reitmayr. Free viewpoint virtual try-on with commodity depth cameras, 2011.

[4] G. Roth D. Bradley and P. Bose. Augmented clothing. In Graphics Interface, 2005.

[5] D. Bradley, G. Roth, and P. Bose. Augmented reality on cloth with realistic illumination. Machine Vision and Applications, September 2007.

[6] Jong-Seung Park, Mee Sung, and Sung-Ryul Noh. Virtual object placement in video for augmented reality. In Yo-Sung Ho and Hyoung Kim, editors, Advances in Multimedia Information Processing – PCM 2005, volume 3767 of Lecture Notes in Computer Science, pages 1324. Springer Berlin / Heidelberg, 2005.

[7] A. Hilsmann and P. Eisert. Realistic cloth augmentation in single view video. In Proc. of Vision, Modeling, and VisualizationWorkshop, volume 2009, pages 5562, 2009.

[8] A. Hilsmann, and P. Eisert: Joint Estimation of Deformable Motion and Photometric Parameters in Single View Videos. Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on Oct. 2009.

[9] David E. Breen, Eric Rose, and Ross T. Whitaker. Interactive occlusion and collision of real and virtual objects in augmented reality , 1995.

[10] Zoran Zivkovic. Improved adaptive gaussian mixture model for background subtraction. In Proceedings of the Pattern Recognition, 17th International Conference on (ICPR04) Volume 2 – Volume 02, ICPR 04, pages 2831,Washington, DC, USA, 2004.

[11] Yuri Y. Boykov and M. Jolly. Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images, Proceedings of Eighth IEEE International Conference on Computer Vision (ICCV), Vol. 1, pp. 105-112, 2001.

[12] Yin Li, Jian Sun, Chi-Keung Tang, and Heung-Yeung Shum. Lazy snapping. In ACM SIGGRAPH 2004 Papers, SIGGRAPH 04, pages 303308, New York, NY, USA, 2004.

[13] Luc Vincent and Pierre Soille. Watersheds in digital spaces: An efficient algorithm based on immersion simulations. IEEE Trans. Pattern Anal. Mach. Intell., 13(6):583598, June 1991.

[14] Yuri Boykov and Vladimir Kolmogorov. An experimental comparison of mincut/max-flow algorithms for energy minimization in vision. IEEE Trans. Pattern Anal. Mach. Intell., 26(9):11241137, September 2004.

[15] Jian Sun, Jiaya Jia, Chi keung Tang, and Heung yeung Shum. Poisson matting. ACM Transactions on Graphics, 23:315321, 2004.

[16] X. Xiao, and L. Ma Color transfer in correlated color space. Proceedings of the 2006 ACM international conference, 2006.

[17] Criminisi Erez Toyama, A. Criminisi, P. Perez, and K. Toyama. Object removal by exemplar-based inpainting. In Computer Vision Pattern Recognition, pages 721728, June 2003.