Clothes Make the Model Amazon's Outfit-Viton generates apparel images on demand.

Published

Jul 01, 2020

Reading time

2 min read

In online retailing, the most common customer complaints are slow shipping and inability to try on clothes. Amazon conceived its Prime program to address the first concern. To answer the second, it built a virtual fitting room. (This is one of three recent papers from Amazon that explore AI in online retail. We’ll cover the others in upcoming weeks).

What’s new: Amazon researchers led by Assaf Neuberger developed Outfit-Viton, a model that generates images of a user wearing any combination of apparel. Their work builds on the earlier Virtual Try-On Network and Characteristic Preserving Virtual Try-On Network.
Key insight: Previous approaches to generating images of a customer wearing a particular outfit often require hard-to-acquire data — say, 3D scans of the person and the clothes, or photos of the clothes both on and off a wearer. Outfit-Viton takes advantage of style transfer, opening the door to more training data and a more interactive user experience.

How it works: Outfit-Viton starts with a photo of the user and photos of clothing items. The network predicts the shape of each clothing item on the user and uses the predicted shape to generate an image of the entire outfit. Then it refines the image to capture greater detail (appearance refinement).

The researchers trained the system on 47,000 images of people wearing various outfits, along with images of the items in each outfit from Amazon’s catalogue. A training example consisted of an image of a person wearing an outfit (the output) and catalogue images of items in the outfit plus an image of the same person wearing a different outfit (the input).
Given a photo of the user, Outfit-Viton creates a 3D model of the user’s body using DensePose. Given a photo of an outfit and user input such as “shirt,” the system segments that garment. A GAN predicts the shape of the user’s body wearing the garment. (See “shape generation” in the diagram above).
Outfit-Viton uses an autoencoder to extract features of each garment such as fabric color and pattern. It provides these features to another GAN, which predicts an initial, low-detail image of the outfit on the user body. (“Appearance generation” above.)
During inference, a third GAN adds further detail. This GAN’s parameters are reset for each garment and trained to reproduce that item only. This garment-specific training adds greater detail than GANs typically produce.

Results: On a 7,000 image test set, Outfit-Viton achieved 20.06 Fréchet Inception Distance, a measure that correlates with human similarity where lower is better. CP-Viton, the state-of-the-art system for the task, achieved 16.63. Human judges preferred Outfit-Viton’s generated images over CP-Viton’s 65 percent of the time.

Why it matters: Training CP-Viton requires photos of a garment both on and off a body. Outfit-Viton can learn from either, so it accommodates a more expansive training dataset and a wider variety of use cases.

We’re thinking: Stores must spur sales even as they enact social distancing measures. A neural network makes a very socially distant dressing room.

Subscribe to The Batch