In Conversation with Victor Riparbelli (CEO) and Matthias Niessner (Co-Founder), Synthesia

One of the most exciting emerging areas for AI is content generation. Powered by anything from GANs to GPT-3, a new generation of tools and platforms enables the creation of highly customizable content at scale – whether text, images, audio or video – opening up a broad range of consumer and enterprise use cases.

At FirstMark, we recently announced that we had led the Series A in Synthesia, a startup providing impressive AI synthetic video generation capabilities to both creators and large enterprises.

As a follow up to our investment announcement, we had the pleasure of hosting two of Synthesia’s co-founders, Victor Riparbelli (CEO) and Matthias Niessner (co-founder and a Professor of Computer Vision at Technical University of Munich).

Some of topics we covered:

The rise of Generative Adversarial Networks (GANs) in AI
Use cases for synthetic video in the enterprise
Synthetic videos vs deep fakes
What’s next in the space

Below is the video and below that, the transcript.

(As always, Data Driven NYC is a team effort – many thanks to my FirstMark colleagues Jack Cohen and Katie Chiou for co-organizing, Diego Guttierez for the video work and to Karissa Domondon for the transcript!)

VIDEO:

TRANSCRIPT (edited for clarity and brevity)

[Matt Turck] Welcome, Victor and Matthias from Synthesia. And to give everybody context, we are going to jump into a video that’s going to give a nice preview about what Synthesia is all about.

[Synthesia video]

[1:53] The whole idea here is that the technology’s able… you type in something, and then you create those avatars, which for now are based on real-life actors, but you have the ability through AI to make them say things in a bunch of different languages with expression and all those things. So, this is at the intersection of voice and computer vision and all those things. Maybe walk us from an AI perspective, how does that work? What is this based on?

[Matthias Niessner] [2:29] A lot of directions of this research actually comes from computer graphics from the movie industry. So, when you’re having an actor that has a stunt double or something like this, you have to virtually replace, edit the faces, edit the actors. And the movie industry over the decades has made successive progress in order to make it easier for editors, for artists to mitigate the effort there.

[2:56] And kind of the thing that happened in the last 10 years, let’s say, in the last, well, last 10 years, a lot of things in AI and deep learning have happened. So, traditional graphics methods have been augmented with AI methods now. And this has become a lot easier. So, you have now generative AI methods, like generative adversarial networks. And these kinds of technologies, they help a lot to make this process even easier than it used to be before. So, instead of having artists and so on that manually fix face replacements of all the actors, you now have an AI that does it all automatically.

[3:26] What is a generative adversarial network?

[3:33] The idea of a generative network is you show a network a bunch of images of faces, and the network learns how to create new images from faces, the kind of learning the distribution of existing people that the neural network has seen. And then you can create new images that look like faces, but they’re not specifically any of the existing observed images.

[3:55] There has been a lot of work around GANs in the computer vision AI history in the last, well, five, six, seven years. And now, there’s a lot of new stuff coming that you can actually make not just images out of it, but you can actually create full videos out of it and can make these things look very, very realistic. You can create very high-resolution videos and make them pretty much indistinguishable from any real videos.

[4:25] Taking this broad world of research and what came from the movie industry, plus GANs, how did you translate it into what we see today?

[4:42] I was always super excited about video games and movies. I was very inspired by science fiction, like Star Trek and these kinds of things, where you had a holodeck and you could create virtual environments that looked like real environments. So, kind of the high-level idea is you want to create holograms. And this is a really big challenge, of course, not just taking a 2D image, but taking this kind of 3D hologram that you can later animate.

[5:06] So, a lot of the research actually comes from 3D visioning, so 3D reconstruction of people, 3D tracking of people, 3D face recognition, 3D face tracking, getting 3D models from faces. And traditionally speaking, this has been done with standard 3D models. And now, with the combination of these AI technologies now, with deep learning, GANs, and all those things that I mentioned, it’s become a lot more feasible now, and a lot of new opportunities have come. And we actually kind of close, well, to making this dream of holograms reality. We’re not quite at holograms, but at least we can create pretty realistic videos.

[5:44] Victor, how did you guys connect and start the company? Maybe walk us back to the origins of the company and how the four co-founders, so, you, Steffen [Tjerrild, COO & CFO], Matthias and Lourdes [Agapito, Professor of 3D Computer Vision at University College London], connected.

[Victor Riparbelli] [6:00] I’ll give you the short version. Like most founding stories, it’s a long, complicated story. But me and Steffen had worked together back in Denmark at a venture studio, it was almost 10 years ago now. We had great energy with each other. I think we had the same level of ambition. And so, we kind of had a good partnership going on there, but we decided to go two different paths. Steffen went to Zambia to work in private equity. And I went to London because I had figured out that I loved building things, but I was more passionate about science-fiction technology than I was about building accounting programs or kind of business tool type of software.

So, I went to London and I started working on AR, VR type of technologies, which I’m still very excited about, but I think the market is still growing. Let’s put it that way. And through my work there, working with these AR, VR technologies, I met Matthias. I had also spent some time in Stanford. Matthias spent some time at Stanford. And I sort of started looking at these technologies, like “Face2Face” was Matthias’ I think probably most well-known paper in the space of what we’re doing. And I just got very interested in how these AR, VR technologies, 3D computer vision, deep learning had kind of hit an inflection point, where they went from being super cool, but now also super useful. And we saw that with things like Oculus, for example, which in large part is driven by advances in these fields.

[7:29] And I got very excited about the idea of applying that to video, because video was already a massive market. The video economy is growing. And you don’t have to convince anyone that video’s going to be a big market. And I think, when I saw that paper, it was kind of a glimpse into the future. And that’s kind of how we started talking. Professor Lourdes Agapito from UCL was also someone that I had been involved with quite a lot. And we just got really excited about this idea of creating technology that would make it easy to create video for everyone without having to deal with cameras and actors and studio equipment every time you wanted to create a video. So, that’s kind of how it all came together.

[8:04] We saw a glimpse of it in the video at the beginning of this conversation, but can you elaborate on what the product does? So, you took those broad, impressive capabilities in AI, and turned them into an actual software product. What does it do?

[8:24] Synthesia operates and is building the world’s largest platform for video generation. On our platform, you can essentially create a real, professional video directly from your browser. So, we have a web application that I think you have had a glimpse of before, which is very simple to use. You go in, you select an actor, which could be one of the stock actors that’s built into the platform or you can upload yourself with three to four minutes of video footage. You can then create videos by simply typing in text. So, that’s the kind of fundamental idea. You type in the text of the video that you’re creating. You could use our editor to add images, text – a sort of PowerPoint style of creation. You hit “generate”, and in a couple of minutes, your video is ready.

[9:05] Creating these video assets has gone from being a super unscalable process of having to deal with cameras, actors, studios, expensive equipment, assets that, once they’ve been recorded with a camera, can’t really be changed to something you can now do basically as a desk job.

[9:21] That’s the core product, and that is used by two distinct groups of customers. We work a lot with the Fortune 1000 and we work a lot with individual contributors, or individual creators. The core idea here is I think this is something that people sometimes mistake from the outside, our platform is not really a replacement for traditional video production, as you know it with cameras. It’s actually more a replacement for text. That’s the big thing here. What our customers are using our platform for is taking all of the computations, primarily in learning and training right now, and they’re still creating hero shots for the most important piece of content, but for all the long tailed content that lives as text, they can now make videos.

[10:05] So if you imagine you’re a warehouse worker in a very, very large technology company, for example, and you have to be trained on COVID guidelines or need a company update. For by far most people, video is a much better medium to communicate with than a five page PDF. So that’s kind of the core use case of our platform today.

[10:30] So the use cases are mostly B2B enterprise. Your platform targets large corporations around the world. And so just to double-click on this, some of the use cases are learning and training where people need to learn to be able to do their job and also onboarding, that’s correct right, across different industries?

[10:52] From a general perspective, video is a much, much more effective medium to communicate, even more so in a remote world. What we’re seeing our customers use [our platform] for is all these things which traditionally would be text, which could be onboarding documents, manuals, training, and learning. Those are the ones that we’re working primarily with right now, but we’re also slowly starting to see the first kind of use cases with external content like marketing, for example, where you also want to have video assets instead of text assets.

[11:26] And then there’s the kind of very interesting cross section of those two which I will call customer experience, we see that a lot here. So let’s say that you’re a bank, for example. You have FAQ, help desk articles, and you have a lot of them because you have a very complex product. For most users, they’re simply not just going to read through a long page of text explaining how insurance works or how a credit check works and things like that. And they’re now starting to use videos generated in our platform to communicate those kinds of things as well.

[11:53] What we are also slowly starting to see is not just turning text into video, but data into video. So the big idea that Synthesia is built around is that video production and media production in general is going to go from something that we record with cameras and microphones to something we code with computers. And once all of this production layer has sort of been abstracted away as software, we could do a lot of new things with video we couldn’t do before. So we could take data around a particular customer, for example, and make a video that speaks to them on the kind of gimmick level, of course, this is like it says your name, but the much more interesting level we can, if it’s a bank, for example, you can take data around how much money do you have left in your account, what did you spend your money on last month, and we can build these interactive, short videos that is much more effective at communicating with customers.

[12:42] And it’s now provided as an API.

[12;47] Exactly. I think we have a very strong belief that what we’re building here is a foundational technology that’s going to change how we communicate online. Right now, we’re doing a lot of this sort of training stuff, but as I mentioned before, the really interesting idea here is that media production will become code. And once it’s code, we have all the benefits of working with software. It scales infinitely, it has more or less zero marginal cost, and we can make it available to everyone.

[13:15] And the API is super interesting because it means that you can create any experience online that usually would be static or text, and you can make it interactive and video driven. So I think that linear video is one element of what we’re doing right now, but the API part of it, of which we’re launching our V1 of the platform in a couple of weeks, will enable a whole new space for opportunities, most of which probably haven’t been thought of yet. And I think this is what’s so exciting about this space is that I think that moving forward, this will make it as easy to create a video driven website as a text driven website. And I think that’s going to have major implications for how we kind of go about the user experience online.

[13:56] And for now the avatars, like the speaking figures, are based on real life actors, you have a whole library of people. Then the idea is to then create synthetic avatars so each company could have their own sort of spokesfigure type person. Is that the path?

[14:18] There’s two streets there. One which is already used by roughly 80% of our enterprise clients today, which is that you can create a real avatar of yourself or that could be someone from the leadership team or management team, for example, or it could be someone who’s a brand representative. This process is something we’re working on scaling a lot right now. It is quite an easy process today, but we definitely believe that we all are going to have a digital representation of ourselves, a kind of an avatar for ourselves that we can use for creating video or maybe even Zoom calls in the future.

[14:52] So that’s one stream, it’s taking you and making a digital version of you with your voice. You can speak any language, you can do PowerPoint presentations live, you could maybe even do these conferences one day by just typing in text.

[15:03] And then the other stream of it is what we think of as synthetic humans. But this is where you can, some people might’ve seen the meta humans approach from Unreal that came up recently, but this is where you’d go in, kind of when you start a computer game and you create a character, you can brand that character with a logo and put on a hat if it’s a fast food chain, or whatever you want to do. And then you can create these kinds of artificial characters that can represent your brand. And that’s also quite interesting because that allows for a whole new level of diversity and different types of people representing your brand rather than just one face of the company, which is generally the case today.

[15:40] This feels like a good place to address the obvious question around deep fakes, which to be precise on the definition means taking somebody’s appearance without their consent and creating video content to make them say something that they’ve never said. So can you maybe walk us through how you think about this, both from a technical and definitional standpoint, but also as in general from an ethical perspective.

[16:19] Yeah, sure. So this is obviously a really powerful technology and I’m sure everyone in this audience is familiar with the concept of deep fakes, as you just explained it. I think that with all new technologies they’re kind of, especially if they’re very powerful, they pop into the world and we’re immediately very scared of them. The fear is definitely real, these technologies will be used for bad for sure.

[16:44] For us, there’s kind of a few guiding sort of principles around how we can minimize the harmful effects of these technologies. So one is sort of inside of Synthesia ensuring that our tech is not used for bad, that is relatively easy to do because our experience is sort of fully on rails. We’re building identity verification and things like that. Outside of our platform and what we’re working on, I think there’s a few interesting things to mention here. First one’s education. We’ve been able to forge texts and images and other forms of media for the last 30 years, and while people definitely still get fooled by text or images, we all have some kind of embedded understanding that not everything you read online is necessarily true.

[17:27] We need to do the same for video. One part of this is obviously through some kind of a media stunt, and we’ve done a lot with Beckham and Lionel Messi for example, which is an experience that can come up really broad to the world to talk about these things. But I actually also think that exposure of this type of media is the most important part of it. Once you start getting personalized birthday messages from David Beckham or Lionel Messi, for example, you know, that’s not real, that builds that sort of embedded sense of this new online world.

[18:00] And then the last one is technology solutions. So this is a very, very big topic obviously. I think the first one that people sort of latched onto is, “Let’s build deep fake detectors.” And both Synthesia as a company and Matthias is working with a lot of the larger companies are sharing data and helping them with their deep fake detection tools. And I think that might remove the bulk of content that might be created.

[18:25] I think what I’m more excited about in the long-term is building a provenance system, which is less about deep fakes in particular, but more about how can we build a provenance of media content and how it sort of comes out online? So a real example of this could be, you’re the BBC, you upload something to the internet, the first thing you do is you’ve registered in a central database somewhere. I kind of hate the word, but this might be something a blockchain actually could be good for.

[18:54] And then you have a system on YouTube and other major platforms, which every time someone uploads a piece of video, sort of scans and has this video or something very close to this video been uploaded before. And then we could build this like chain of provenance. I think I’ve usually explained this as like Shazam for video content, the app that can listen to a song and tell you what it is. If you could build a similar type of system for video content, I think that would take us a long way for the user to understand where content came from and how it has been manipulated along the way.

[19:24] One of the questions that came up in the chats and maybe for either of you, is around languages, how many languages does Synthesia support and how does that work behind the scenes?

[19:38] Yes, we support 55 languages right now. That is a feature that I think more or less all of our clients are using. The way that it works is that we have a very broad selection of text to speech voices, which drives the avatars. So on the backend, that’s how it works. The kind of core technology that drives our avatars. We can take an audio signal and we could turn that into a video.

[20:01] From a user experience perspective, it’s really, it’s quite easy to have this text box where you type in the script. If you type it in English, the video will come out in English. If you type in French, it will come out French. If you type it in Italian, it will come out in Italian. So that’s how it works right now. But I think the very interesting thing about synthetic media to me is that we are building this video synthesis technology, which solves a very hot problem, but there’s all these other technologies, which are all complementary and will work as… I think force multipliers in this space. So everybody knows GPT-3 of course, what’s that going to mean for translation? What’s that going to mean for automatic concentration? I think right now, that’s how it works, but in the future, I definitely foresee machine translation becoming 10X better than what is today, and then it’ll be really interesting.

[20:48] Matthias. What do you think is next for the space in terms of capabilities? What is AI going to be able to do in three, four years that is not currently able to do now or not well?

[21:03] Yeah, I mean, I’m very excited about the space. Both as a researcher, as an entrepreneur, it’s super interesting. So AI, in a sense it’s become more than a tool right now already. So you see AI as typically basic math, what you had learned in high school, this is what AI will become from an educational perspective. So there are a lot of people, basically everybody in computer science right now, learns basic AI. With that knowledge of people adopting and universities and so on, there will be a lot of progress on the research side and also on the startup side. You’re not unique anymore when you use an AI in a company at this point, you actually… Basically you have to use it just to compete with the massive scale of data and these kinds of things.

[21:48] In terms of the actual things, what I believe the technology will go to. Right now if you’re looking at this Zoom call, I think it’s still relatively limited. This is from a pure perspective of communication and interaction, it’s super limited right now. You don’t have full 3D views, it’s far away from immersive communications, like you’d be used to from real life. So this is a big challenge, how to make communication better in a sense addressing mobility, but with virtual communication technologies.

[22:21] And the second aspect from AI as a much broader thing is, we, from a language side, for instance, Synthesia will be using still text as input to generate videos. What’s still pretty difficult is to automatically generate the text or to figure out how to have an automated avatar in a sense that could automatically interactively reply and answer questions very reasonably. There have been a lot of efforts, but it’s still very basic. These things are not far… They are just not at the level where you can have a customer service fully replaced with AI. I mean, people try that, but the experiences are not quite there yet. And there I see a lot of potential in the next few years.

[23:02] For us as a company, I think there’s tremendous potential in these lines too. You can see at the moment you’re basically taking text as input, you’re creating a portray avatar that talks to you. But in the future, we will have way more capabilities of maybe having multiple people interacting with each other, possibly doing it in real time, having more emotional parts in it. And in the long run what I see is you basically can generate a whole movie or something like this from just a book. So you’re helping Hollywood in a sense, creating fully featured blockbuster films just by looking at some text.

[23:40] This is kind of the high level vision, and I’m still a big fan of science fiction. I would love to see holograms become reality. I know we still need some version of a display, we need augmented reality devices or VR devices to do that, but from a pure image generation side or video generation side, I would bet in three, four or five years, there’s going to be… This progress is even accelerating. If you’re following the research community, even in the last year, there have been so many cool papers coming out from all kinds of different groups around the world. If you’re in the space, you can be very lucky, I think.

[24:15] To take one last question from the audience before we switch over. Let’s see. So Matthias, you kindly answered a couple in the chat and people can take a look. So a practical question from Alan, “Checking your website, can we load our own images and videos to create bespoke avatars of colleagues or clients to create personal videos?” You alluded to some of this.

[24:48] Yeah, absolutely. I mean, we already have, I think, close to 200 custom avatars on the platform so far, more or less all of our corporate clients are using it for exactly this purpose. The onboarding process is roughly three to four minutes of footage. We’re working on getting that down to just a single image, and once that is done, that’s a one-off process. Then you can create videos for yourself on the platform and you can use our API, and our Zapier integration, for example, to very, very easily create personalized videos and for clients or colleagues or employees, that is definitely the core use case right now.

[25:25] Great. Well that feels like a great place to finish. Thank you so much. Obviously, as an investor, I’m incredibly excited in what you guys are doing, but I think for everyone, this is an absolutely fascinating glimpse into the future, but very much the present as well. It feels like a clear case of the future is already here, it’s just not evenly distributed. So this is one of those moments where you see something that is going to be very prevalent, but it’s just starting, it’s wonderful.

Leave a Reply Cancel reply