The fastest trackers in video searches

Every day, 80,000 hours of video material are uploaded to the video platform YouTube alone, corresponding to a period of nine years. Multimedia researchers around the world are working to improve searches in these video pools, common – amongst others – to many media organisations. One of these researchers is Klaus Schöffmann, who created the Video Browser Showdown for this purpose in 2012. Today, the world’s leading scholars take part in this competition and present their latest approaches to searching for videos.

“Teams of researchers compete against each other in a contest. The aim of the teams is to find a specific sequence in a large pool of videos as quickly as possible.” (Klaus Schöffmann)

The year was 2012, and Klaus Schöffmann and his colleagues were planning to organise the MMM conference in Klagenfurt. Organising this “International Conference on Multimedia Modelling” in a way that broke even required creativity: Schöffmann recalled a competition (VideOlympics) that he had experienced at a conference and which had generated a lot of enthusiasm. Given that the number of registrations was still rather low, everybody put their heads together and came up with the idea of the Video Browser Showdown. The concept is simple to explain: “Teams of researchers compete against each other in a contest. The aim of the teams is to find a specific sequence in a large pool of videos as quickly as possible,” Klaus Schöffmann explains.

In the summer of 2022, attendees celebrated the eleventh anniversary of the Video Browser Showdown at the MMM conference in Phu Quoc, Vietnam. The format has grown to become a model of success: While at first the search involved individual videos lasting 60 to 90 minutes, the pool has now reached 2,300 hours. Year after year, the top researchers in the field prepare meticulously for the showdown, applying for dedicated research projects and competing live at the conference venue or in hybrid event formats.

But what’s so complicated about finding a video sequence in a large pool of video footage? On the one hand, as Klaus Schöffmann points out, the vast volume of analytical data is critical. In this year’s competition, researchers had to search a total of 2.5 million segments. Given these volumes, it is no longer possible for someone to sift through all the material ‘manually’. Another challenge lies in the method used to tackle the competition. “Typically, if you take an approach that the others don’t have at their disposal, you prevail.” Moreover, despite state-of-the-art AI-based image analysis, it is still a major challenge to accurately identify all of the essential content in videos.

Participants in the Video Browser Showdown are given two tasks: In one format, the jury shows the contestants a short sequence lasting twenty seconds, which they then have to find. In the other format, which is even more difficult to master, they are shown a short text. Then, when it comes to finding scenes in which ‘food is being prepared’ or ‘vegetables are being cut’, an image must first form in the researchers’ minds, which they then have to track down in the vast video pool.

Whoever believes that searches such as these function automatically and without any interaction with a human being is disabused of this notion when talking to Klaus Schöffmann: “We need a user to collaborate in the interactive search, i.e. to approach the goal gradually by answering questions or by selecting certain sequences. This kind of search is not yet fully autonomous.”

What may appear like a game between academics has many practical applications and – considering the ever-growing video pools – is also urgently needed. Klaus Schöffmann points to media organisations that often have an enormous stock of data. For these companies, improved video search tools are worth their weight in gold. Videos can also be used for many other practical applications: The team headed by Klaus Schöffmann is researching, for instance, how to improve the searchability of images taken during endoscopic operations. A similar competition to the Video Browser Showdown also exists in the area of supporting people with impaired health: The Lifelog Search Challenge (LSC) is designed to help them remember specific episodes from their lives, which can be helpful, for example, if they are no longer sure whether they have taken their medication.

Whoever wins the Video Browser Showdown nowadays belongs to the international world leaders in this field of research. Teams from 21 countries from all over the world have already taken part. Large research groups have the advantage that they can address the issues in doctoral theses – thereby achieving the best results. “One team from Switzerland has as many as four PhD students who are specifically focused on the tasks involved in the competition”, Klaus Schöffmann tells us. Altogether, more than 20 academic theses (including 15 doctoral theses) have already been written in connection with the Showdown.

“We only won at the inaugural Video Browser Showdown in 2012, but we’ve repeatedly made it onto the podium. As a relatively small team, we’re very proud of that.” (Klaus Schöffmann)

The main prize for the scientists is fame and honour, plus the prize money. We ask Klaus Schöffmann whether the research group at the Department of Information Technology at the University of Klagenfurt is also part of the “Hall of Fame” and learn: “We only won at the inaugural Video Browser Showdown in 2012, but we’ve repeatedly made it onto the podium. As a relatively small team, we’re very proud of that.”

There is also pride in the evolution of the competition and the fact that more and more international researchers are taking an interest in it. Swiss colleagues have now set up a new server that facilitates and optimises the hybrid implementation. This allows the research teams to compete with each other between the annual conferences. The Video Browser Showdown is also streamed live on Twitch. The winning team writes a comprehensive summary and evaluation as a scientific paper – achieving even more visibility. Klaus Schöffmann is still the main organiser of the Video Browser Showdown, together with Werner Bailer (Joanneum Research Austria), Jakub Lokoc (Charles University in Prague), Cathal Gurrin (Dublin City University) and Luca Rossetto (University of Zurich). The success story that began in Klagenfurt in 2012 continues to inspire researchers around the world to deliver outstanding work. What is won here is not only the contests, but – more importantly – new insights for research.

for ad astra: Romy Müller

About Klaus Schöffmann

Klaus Schöffmann is Associate Professor at the Department of Information Technology at the University of Klagenfurt. His research focuses on understanding video content (especially medical, surgical videos), multimedia retrieval, interactive multimedia and applied Deep Learning. He is the (co-)author of more than 100 publications in this field. He is a member of the IEEE and the ACM and a regular reviewer for international conferences and journals in the field of multimedia.

How good are we at evaluating videos?

For cars to drive autonomously, for example, and for image material to be evaluated autonomously, the machine must recognise what it perceives with cameras. If you believe science fiction, we have come a long way in the development of such intelligent machines. Klaus Schöffmann puts this progress into perspective: “It will take a lot more research before a machine recognises what it sees, just like a human being. An autonomously travelling car cannot always distinguish between a tree or a human being. Our problem is the abundance of data. Amidst this profusion of information, we have to cope with inaccuracies, and often we have as little as 70 per cent precision. In sensitive areas of deployment such as autonomous vehicles, this is definitely too imprecise.”

Does searching in large video portals like YouTube work (‘artificially’) intelligently at least? Here, too, the sobering realisation: “Results are found purely through the texts that are posted alongside the video.” Some progress has been made in the area of similarity searching, for example. But if you don’t have any visual material with which you can search for something similar, you have to describe it textually first. Again, the following applies: Is it possible to describe the concept of a tree, a house, an animal or a certain event (e.g. a wedding) with sufficient accuracy to allow us to quickly find what we are looking for in a very large video archive?