Researchers at Seoul National University, NVIDIA and Microsoft Release “ACAV100M”: Automatically Organized Video Dataset for Self-Supervised Audiovisual Learning


Audiovisual (AV) learning is defined by providing and applying educational content that includes both audio and visual information. The natural relationship between visual observations and accompanying sounds has shown strong self-monitoring signals for learning video representations. This is why the massive amount of videos online has become a valuable source of self-supervised learning among research communities.

However, due to audio overlay, online videos frequently provide imperfectly aligned audiovisual signals. Therefore, models trained on unorganized films have been shown to develop poorer performances due to misalignment difficulties. Existing techniques generally rely on manually organized data sets with a predetermined taxonomy of semantic ideas, where audiovisual connection is very likely.

To fill this gap, researchers at Seoul National University, NVIDIA and Microsoft have released an automatic dataset curation pipeline and a large video dataset for self-supervised audiovisual learning, called ACAV100M. (set of audiovisual data with automatic curation). The dataset is made up of a large number of unorganized web videos. The researchers took 140 million full-length videos and narrowed them down to 100 million segments with the best audiovisual match.


The researchers designed the data collection procedure as a constrained optimization problem. They aimed to find a subset that maximizes the total mutual information between the audio and visual channels in the videos. Mutual Information (MI) measures how much knowledge of one variable reduces uncertainty about the other. As a result, a subset with the highest MI is likely to contain a large number of videos with audiovisual connections. Therefore, this method provides an excellent data set for self-supervised learning.

It is possible to evaluate the audiovisual MI for each video separately and generate a selection that delivers the maximum MI. However, due to incompatible audiovisual inputs, models trained on unorganized videos have suboptimal representations. Rather than measuring the instance-level MI, the team uses a set-based estimate of the MI. This method quantifies the information shared by two dataset clustering assignments.

Researchers group videos based on audio and visual cues, then calculate MI using a contingency table that encodes the overlap between audio and visual clusters. The results show that this set-based approach is more resistant to real-world noise, allowing it to produce datasets with better audiovisual matching compared to instance-based MI estimates.


The researchers compare their dataset to video datasets widely used in self-supervised learning to assess its effectiveness in self-supervised audiovisual learning. They do this by using contrastive learning to pre-train identical models on various data sets and conventional benchmarks to complete linear assessment. According to the results, models pre-trained on datasets outperform models pre-trained on existing datasets that require human annotation or manual verification. This demonstrates that the datasets generated by the proposed method provide the audiovisual correspondence required for the self-supervised methods.




The references:



Comments are closed.