Deep Fakes — Deep Learning To Create Fakes

Teja kalvakolanu
17 min readMar 4, 2020

Some of you might have remembered this video that went viral which terrified many people. People believed it to be true until it was clarified by MIT. This started the discussion about the fake videos and its consequences as well as for people who are only exposed to tweaked video clips and shallow fakes these videos were a shock.

A Reddit user coined these fakes as Deep Fakes in 2017. Deep Fakes have got their widespread attention in fake news, hoaxes, financial fraud and morphing. This made the industry and government put additional efforts to limit their use. The deep fakes are evolving and new techniques are being developed to create them which makes it very difficult to detect them and can easily be used for wrong purposes. In this blog I will discuss the history of the deep fakes, Techniques used to create and detect the deep fakes in detail as well as its commercial applications.

What is DeepFake?

Deep Fakes are synthetic media in which a person in an existing image or video is replaced with someone else’s likeness. Unlike shallow fakes that just represent wrong information, Deep fakes misrepresent the person delivering the information as well. Lots of research is being done on how to reduce shallow fakes on the web but we are very far from doing that. At the same time, Deep fakes emerged and posed more threats to the truthfulness of the information that we are seeing. These Deep fakes involve video or audio falsification or both. The defense advanced research projects Agency (DARPA) as a part of the United States military is also funding research to detect the deep fakes.

Evolution of Deep Fakes

Applying AI to Videos started way before deep fakes. UW’s Synthesizing Obama and face 2 faces, fake app, created deep fakes that are even hard to detect. In fact, they are so hard to detect that even Jordan Peele warned about one of them. [video link: https://youtu.be/cQ54GDm1eL0]. One of the first types of the deep fake was “Video rewrite” which uses machine learning algorithms to match the audio to the lip sync. In January 2018 a desktop app called fake app was launched which catered the ability to create deep fakes to the masses. In late January deep fakes were found of people around them or whom they knew. These deep fakes were found on Reddit and chat app Discord. Following this revelation numerous platforms including Twitter, discord, Gfycat banned deep fakes and associated communities. Gfycat, in particular, used Ai detection methods in an attempt to proactively police deep fakes.

A dedicated subreddit called deep fakes also gained popularity at the same time. Reddit waited until Feb 2018 to broadly classify and ban the subreddit as well as any deep fakes pornographic content. In April 2018, BuzzFeed published a video of Barack Obama but surprisingly unlike the video of MIT this video added the voice content of Obama to the deep fake that wasn't his own. It is reported that it took 56 hrs of training to create this deep fake by a single user by using the fake App. In June 2019 a deep fake video of Mark Zuckerberg surfaced online but was not taken down by Facebook which said it's against their policy. The lawmakers asked the intelligence a report on the threat of deep fakes after the pentagon decided to fund the research on deep fakes. The rules and policies implemented on deep fakes have brought mixed results.

Currently, there are techniques that can create deep fakes by mirroring the body movements (Berkeley University), transferring facial expressions with deep video portraits (Stanford University), Erasing objects from an existing video (Adobe) and Generating artificial voices based on audio samples of real people (Lyrebird).

How DeepFakes are created?

Before going into the concept behind the creation of deep fakes we have to know that the creation of deep fakes involves video as well as audio modification. These can be done by

  • Superimposing someone’s face onto a body so i look like they did something they never did. [1]
  • Taking a speech and manipulating it onto an unsuspecting person’s face so it looks like they said it. [7]

Pre Processing

Before the training, we need to prepare thousands of images for both persons. We can take a shortcut and use a face detection library to scrape facial pictures from their videos. Spend significant time to improve the quality of your facial pictures. It impacts your final result significantly.

  1. Remove any picture frames that contain more than one person.
  2. Make sure you have an abundance of video footage. Extract facial pictures contain different poses, face angle, and facial expressions.
  3. Remove any bad quality, tinted, small, bad lighting or occluded facial pictures.
  4. Some resembling both persons may help, like similar face shape.

Superimposing

Superimposing is overlaying an image over the other. The two faces from the image are extracted in the first place. Superimposing face A and face B involves extracting the features, encoding them and decoding them into the target image format. Convolutional neural networks are used to extract the most important features since there are millions of features all of them cannot be extracted.

This network is trained to minimize the difference between the decoded image and the target image to achieve as close to the target image as possible. This training requires a lot of time and effort. GAN’s are used for training. The loss function used is the perceptual loss function which calculates the difference between the features of input and target image and tries to reduce it. Other loss functions like Mean standard Error or similar loss functions work by calculating loss for each pixel therefore they cannot detect the loss in the image quality or texture. Using the feature or perpetual loss overcomes this and represents all the features of the input image in the target image with the least loss.

Src: @Univeristy of Washington

After training the network to minimize the loss of information during encoding and decoding for a long time we encode the features extracted from both the images and decoding is done by passing the encoded features of image 1 to the decoder of image 2 and vice versa. It is nothing but trying to extract the features of image 2 from features of image 1. The decoded image1 is then superimposed on to the image 2 to obtain the style properties.

Src:@UW

Training

The type of CNN used for the training is General Adversarial Networks(GAN’s). They have a discriminator which distinguishes whether facial images are original or created by the computer. When we feed the real images to the discriminator, we train the discriminator to recognize the real image better. When we feed created images into the discriminator, we use it to rain our autoencoder to create more realistic images. We perform this until the created images are not distinguished from the created ones.

Generative Adversarial Networks [9]

Src: @fastAI by Jeremy Howard

To train a Generative adversarial network efficiently it’s important to introduce the noise into the existing images that are to crappify images. The GAN’s tries to solve the loss function problem by calling another model.

Let’s assume that we have the crappy image and we have already created a generator. It’s not a great one, but it’s not that terrible and that’s creating predictions. We have a high-resolution image and we can compare the high-resolution image to the prediction with pixel MSE.

We could also train another model which we could variously call either the discriminator or the critic they both mean the same thing. We can build a binary classification model that takes all pairs of the generated image and the real high-resolution image, and learn to classify which is which. So look at a picture and just say whether it is a high-resolution image that is similar to target or low-resolution image that is not the target image. A regular standard binary cross-entropy classifier would work here. So instead of the regular pixel MSE loss, the loss could be how good are we at fooling the critic? Can we create generated images that the critic thinks are real?

That would be a good plan, because if it can do that if the loss function is “am i fooling the critic?” then it will learn to create the images which the critic can’t tell whether they are real or fake. we could do that for a while, train a few batches. But the critic isn’t that great. The reason is that the images are crappy so it’s easy to tell the difference. So after we train the generator a little bit more using the critic as the loss function the generators are going to get better at fooling the critic. So now we stop training the generator and we will train the critic some more on these newly generated images. When the critic becomes better at recognizing the difference between the generated and the originals we will go back and train the generator using the new better discriminator as the loss function.

GAN’s take an extreme amount of time to train and it is often very expensive. Even Though there is not much research done in reducing the time for the GAN. The founder of Fast Ai Jeremy Howard proposed using the pre-trained model as a generator and discriminator and it reduced the training time drastically. It is because if the generator and discriminator are not pre-trained the generator cannot generate decent images at the same time the discriminator will also fail at detection without much training for the generator. This loop between the discriminator or critic and generator continues and it takes a large amount of time for them to even start learning what is required for the specific task. If the pre-trained models are used the generator and discriminator start to get better very quickly as they are already good at detecting better features. Since the GAN,s raining time takes most of the time in the kick start phase this time can be completely avoided and training time and resources will be improved.

Audio mixing

In this part, we will discuss the lip-sync technology done at the University of Washington. This research is about replacing the audio with new audio and then matching the lips shape of the person in that video to the audio. They chose Obama because of the large amount of available public videos of him and most of his videos have similar tone and texture. Inspire of available data the biggest challenge is to convert the audio into the time-varying image. They borrow all the features of the face from the composite image while creating only the mouth region by themselves.

Audio to sparse Mouth Shape

Audio to sparse mouth shape conversion is done by mapping audio features to the mouth shape, applying texture to mouth shape and adding the mouth to the composite image by borrowing other features that do not change during the speech. To map audio features to mouth shape it is divided into two steps generating MFCC audio coefficients and mapping these coefficients to 18 PCA y points that denote the outer and inner lip from the stock footage. The output shape not only depends on the audio but also the previous shape therefore they choose recursive neural network with Long short term memory (LSTM) version. The new input whenever encountered is sent through the recursive neural network which already contains the previous state, therefore, the combination of the input and the previous state will produce the output of the next frame. The previous state history can be as long as possible because of the LSTM version.

Sometimes his mouth moves fast before the audio is heard. For example, Obama says “ohh” before saying the word “America ‘’ which hints that he is about to say something but before the word is heard his mouth will be wide open therefore just the past state and current input cannot determine the mouth shape properly. We need to introduce some future variables into the context to model it even better. One option is implementing bi-directional RNN which can backpropagate and takes future context into consideration but this takes an extreme amount of computational resources to train the model. Instead, this problem can be tackled by adding a short future context to the unidirectional network by shifting the output with the time delay.

Src: @UW

Facial texture synthesis

The generation is done only with the area of the mouth and its surroundings like chin , wrinkles, etc. while the aspects like forehead, eyes etc are directly taken from the composite image. This texture synthesis algorithm is defined to satisfy two requirements 1) sharp and realistic appearance per video frame 2) temporally smooth texture changes across frames. The algorithm works by taking per mouth PCA shape and finding the fixed number of frames that match with the audio and then taking the median of the texture of all the candidates selected.

Src:@UW

Video Retiming and Natural head position

We assume the availability of a target video, into which our synthesized mouth region will be composited. Any of Obama’s weekly presidential address videos work well as targets, for example. Since the speech in the target video is different from the source (input) speech, a naive composite can appear awkward. In particular, we’ve observed that it’s important to align audio and visual pauses; if Obama pauses his speech, but his head or eyebrows keeps moving, it looks unnatural.

To solve this problem, we use dynamic programming to re-time the target video. We look for the optimal monotonic mapping between N synthesized mouth animation frames and M target video frames such that:

  • It prefers more motion during utterance and minimal motion during silence
  • Any target video frame may be repeated at most once but never skipped. This limits slowdowns to at most 50% and the video cannot speed up; otherwise, a noticeable jump or freeze can occur.
  • It prefers sections of the target video where slowing down would be least noticeable, i.e., not during blinking or quick-expression changes.

Compositing Into the target video

Compositing into the target video is the final step of our algorithm. By this point, we have created a lower face texture for each mouth shape corresponding to the source audio. We have also re-timed the target video to naturally fit silence or talking moments in the source audio. The key part of the composition step is to create a natural, artifact-free chin motion and jawline blending of the lower face texture into the target head. The artifacts are especially visible when watching a video. Therefore, a jaw correction approach is used that operates per frame.

Jaw correction:

Optical flow is computed between the lower face texture frame and target video frame. Next, an alpha map is created to focus only on the area of the jawline. The flow is masked by the alpha map and then used to warp the lower face texture to fit the target frame.

Final compositing:

This step involves masking and blending. The blending is done using Laplacian pyramids in a layer-based fashion. There are four layers that are blended in the following order from front to back: 1) Lower face texture (excluding the neck), 2) torso (shirt and jacket), 3) Neck and 4) the rest. Parts 1 and 3 come from the synthesized texture, while parts 2 and 4 come from the target frame.

The neck mask in the region under the chin in our synthesized mouth texture and the mouth mask is the region above. The chin is determined by splining face contour landmarks estimated from the DLIB library. In some target videos where the background is easy to segment, e.g. when it is a solid black, we create an additional mask for the background (via a color detector) and add it to the shirt mask to have the second layer include both the shirt and background. The final texture is rendered back to the target frame pose using the 3D shape estimated in the synthesis part.

Common Problems

The most common problems that occur are flickering. This occurs because the face is imposed on the other face and when this happens if there is a difference in pictures background and the mask of imposed image don’t blend in better [video]. This can be tackled by

  • applying a gaussian filter to further diffuse the mask boundary area,
  • configure the application to expand or contract the mask further
  • control the shape of the mask.

Detecting the Deep Fakes

In this section let’s go through a few common approaches used to detect the deep fakes. These techniques can be outdated soon as the deep fakes are evolving every day.

Eye blinking time

The time of eye blinking for a general person is at least once per 2 seconds but the Deepfakes seem to have no blinking for a long time. This defect could be overcome in a very short time. This technique is used during the initial phase of deep fakes evolution.

Image split detection by Berkeley [4]

They used methods to identify slice detection and slice localization where slice detection detects whether the image is original or not and splice localization detects where the image is tweaked. For this, they use a siamese network and select two random patches to check whether they have the same properties or they have the same metadata or not. They employ image self-consistency to do this.

The temporal-spatial distribution between frames

The temporal-spatial distribution between frames can distinguish the discontinuity between the frames. Biological signals are quantitatively more descriptive for deep fake detection. It is proven that the biological features cannot be preserved in the videos that are synthesized. Both spatial and temporal properties of biological signals are important and they are not preserved in the deep fakes. It is proven that if these biological features are exploited with the use of Support Vector Machines ( SVM ) they perform significantly better than existing deep learning models used to detect the deep fakes.[ 10 ]

Defects in the deep fakes [ 5 ]

When a deep fake video synthesis algorithm generates new facial expressions, the new images don’t always match the exact positioning of the person’s head, or the lighting conditions, or the distance to the camera. To make the fake faces blend into the surroundings, they have to be geometrically transformed — rotated, resized or otherwise distorted. This process leaves digital artifacts in the resulting image. You may have noticed some artifacts from, particularly severe transformations. These can make a photo look obviously doctored, like blurry borders and artificially smooth skin. More subtle transformations still leave evidence, and we have an algorithm that can detect it, even when people can’t see the differences.

Adding noise to the images [ 3 ]

Image libraries of faces are assembled by algorithms that process thousands of online photos and videos and use machine learning to detect and extract faces. A computer might look at a class photo and detect the faces of all the students and the teacher, and add just those faces to the library. When the resulting library has lots of high-quality face images, the resulting deep fake is more likely to succeed at deceiving its audience.

A specially designed noise is added to digital photographs or videos that are not visible to human eyes but can fool the face detection algorithms. It can conceal the pixel patterns that the face detectors used to locate a face and create decoys that suggest there is a face where there is no one, like in a piece of the background or a square of a person’s clothing. With fewer real faces and more nonfaces polluting the training data, a deep fake algorithm will be worse at generating a fake face. That not only slows down the process of making a deep fake but also makes the resulting deep fake more flawed and easier to detect.

Blockchain [ 6 ]

It is believed that blockchain could be a solution to the deep fakes. It has a new way of documenting the internet. It has the ability to provide immutable and tamper-proof records of data and transactions in a decentralized distributed ledger. Startups such as Truepic and Serelay have developed a system involving mobile apps for capturing and saving imagery to the company’s servers. Truepic uploads the whole image and uses blockchain to store the metadata to ensure immutability, while Serelay computes only a unique fingerprint, which is saved in its servers. These methods are promising but rely heavily on third parties whereas the whole point of blockchain is to decentralize and eliminate these intermediaries.

Sherlock AI [8]

Sherlock is a software developed by Stanford researchers that uses CNN that detects the anomalies in the video. This is reported to have achieved 97% accuracy with deep fakes data set by google. This dataset is created by Google and used by many deep learning researchers to perform research on deep fakes detection. This dataset is currently made open source by Google.

Commercial Applications of Deep fakes

Synthesia is a company with a commercial product that uses deep fake tech to do automated and convincing dubbing through automated facial re-animation. They shot to prominence with a video that featured David Beckham talking about Malaria in nine languages, but their product could also be used to expand the reach of creators around the world. If you are a talented artist who isn’t working in one of the world’s dominant languages, it’s potentially career-changing to have access to a product like this, which could make your work viable in additional languages and countries.

Final thoughts

The race is going on between generating deep fakes with techniques to detect deep fakes. The more new methods are created to detect deep fakes the more the deep fakes are getting better with the defects. The government and industry are taking initiative to limit the deep fakes. It is very interesting to see how AI is being applied in different fields but not without a warning. These deep fakes can be misused to create chaos among country states, Countries and individuals like us. Once the tech becomes good it will become difficult to identify the fakes from the originals. Thankfully people are becoming more aware of the fake news on the net. More rational thinkers are skeptical when they read or see something online which out blindly believing them.

References

[1] Thanh Thi Nguyen and Cuong M. Nguyen and Dung Tien Nguyen and Duc Thanh Nguyen and Saeid Nahavandi “ Deep Learning for Deepfakes Creation and Detection” arXiv:1909.11573, 2019.

[2] Jonathan hui. “How deep Learning fakes videos and how to detect them” In Medium, 2018.

[3] Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. “FaceForensics: A Large-scale Video Dataset for Forgery Detection in Human Faces.” arXiv preprint arXiv:1803.09179, 2018.

[4] Minyoung Huh, Andrew Liu, Andrew Owens, Alexei A. Efros, “Fighting Fake News: Image Splice Detection via Learned Self-Consistency.” arXiv preprint arXiv:1805.04096, 2018.

[5] Kemelmacher-Shlizerman, Aditya Sankar, Eli Shechtman, and Steven M. Seitz. “Being John Malkovich.” In European Conference on Computer Vision, pp. 341–353. Springer, 2010.

[6] Maras, M. H., and Alexandrou. Determining the authenticity of video evidence in the age of artificial intelligence and in the wake of deep fake videos. The International Journal of Evidence and Proof, 23(3), 255–262.,2019.

[7] SUPASORN SUWAJANAKORN, STEVEN M. SEITZ, and IRA KEMELMACHER-SHLIZERMAN, University of Washington, Synthesizing Obama: Learning Lip Sync from Audio. In Washington, 2018.

[8] Kristina Libby, This Bill Harder Deepfake Video is Amazing. It’s also terrifying for our future, in popular mechanics, Aug 2019.

[9] Jeremy Howard, Hiromi, General Adversarial networks, Unets and resnets , in fast AI, 2018.

[10] Umur Aybars Ciftci, Ilke Demir, Lijun Yin, FakeCatcher: Detection of Synthetic Portrait Videos using Biological Signals, arXiv:1901.02212v2 [cs.CV], 9 Aug 2019 .

--

--