Multi-Modal Semantic Inconsistency Detection in Social Media News Posts

Scott McCrae

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2021-73

May 13, 2021

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-73.pdf

As computer-generated content and deepfakes make steady improvements, semantic approaches to multimedia forensics will become more important. In this work, we introduce a novel classification architecture for identifying semantic inconsistencies between video appearance and text caption in social media news posts. We apply a multi-modal fusion framework to identify mismatches between videos and captions in social media posts by leveraging an ensemble method based on textual analysis of the caption, automatic transcription, semantic video analysis, object detection, named entity consistency, and facial verification. To train and test our approach, we curate a new video-based dataset of real-world Facebook news posts for analysis. Our multi-modal approach achieves 60.5% classification accuracy on random mismatches between caption and appearance, compared to accuracy below 50% for uni-modal models. Further ablation studies confirm the necessity of fusion across modalities for correctly identifying semantic inconsistencies.

Advisors: Avideh Zakhor

BibTeX citation:

@mastersthesis{McCrae:EECS-2021-73,
    Author= {McCrae, Scott},
    Title= {Multi-Modal Semantic Inconsistency Detection in Social Media News Posts},
    School= {EECS Department, University of California, Berkeley},
    Year= {2021},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-73.html},
    Number= {UCB/EECS-2021-73},
    Abstract= {As computer-generated content and deepfakes make steady improvements, semantic approaches to multimedia forensics will become more important. In this work, we introduce a novel classification architecture for identifying semantic inconsistencies between video appearance and text caption in social media news posts. We apply a multi-modal fusion framework to identify mismatches between videos and captions in social media posts by leveraging an ensemble method based on textual analysis of the caption, automatic transcription, semantic video analysis, object detection, named entity consistency, and facial verification. To train and test our approach, we curate a new video-based dataset of real-world Facebook news posts for analysis. Our multi-modal approach achieves 60.5% classification accuracy on random mismatches between caption and appearance, compared to accuracy below 50% for uni-modal models. Further ablation studies confirm the necessity of fusion across modalities for correctly identifying semantic inconsistencies.},
}

EndNote citation:

%0 Thesis
%A McCrae, Scott 
%T Multi-Modal Semantic Inconsistency Detection in Social Media News Posts
%I EECS Department, University of California, Berkeley
%D 2021
%8 May 13
%@ UCB/EECS-2021-73
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-73.html
%F McCrae:EECS-2021-73