CV - Junbin Xiao / Home Page

Junbin Xiao (肖俊斌)

Ph.D in Computer Science

Com4 AI Lab, SoC, NUS

I am a Research Fellow at NUS, working with Prof Angela Yao and Tat-Seng Chua. Previously, I obtained my PhD at the Department of Computer Science, National University of Singapore (NUS), supervised by Prof. Tat-Seng Chua and closely collaborated with Prof. Angela Yao . From Nov. 2021 to Apr. 2022, I worked as a research intern at Sea AI Lab (SAIL) and was jointly advised by Dr. Pan Zhou and Prof. Shuicheng Yan. Prior to that, I received my M.S.Eng degree from the Institute of Computing Technology, Chinese Academy of Sciences at 2018 and B.Eng. degree from Sichuan University at 2015, respectively.

I devote myself to developing AI techniques that can understand the physical world, interact and communicate with human beings to provide personalized assistance. The topics of interest cover video understanding, question answering, visual grounding, and robotics. The techniques emphasize multimodal large language models, robustness and trustworthiness. I am recently focusing on trustworthy Multimodal LLMs and their applications in egocentric embodied assistance. I am actively looking for Research Interns/Assistants/Visiting Students. (NUS master students who are with CV/NLP/MultiModal experiences are highly preferred.)

News

Invited to be Area Chair in CVPR'26, and Reviewer in ICLR'26.

| Sept. 2025

Six papers are accepted to SIGIR'25,ICMR'25,MICCAI'25 and ICCV'25 respectively

| Jun. 2025

Invited to be reviewer in NeurIPS'25, MM'25.

| Mar. 2025

I will give a talk about NExT-GQA: Visually Grounded VideoQA inivited by Twelve Labs

CVPR'24 | Jul. 2024

Two papers about video-language models and trustworthy K-VQA are accepted to ACL'24 and MM'24 respectively

| Jul. 2024

Our exploration of VQA in trustworthiness,3D object affordance and ego-car accident (3 papers) all are accepted to CVPR'24

CVPR'24 | Feb. 2024

Invited to be reviewer in CVPR'24 and ICLR'24.

| Oct. 2023

Two papers are accepted to T-PAMI'23 and ACM MM'23 respectively.

Aug. 2023

Two papers are accepted to T-PAMI'23 and ICCV'23 respectively.

Jul. 2023

Invited to serve as PC Member in AAAI'24.

AAAI | Jul. 2023

Invited to be reviewer in NeurIPS'23 dataset and benchmark track.

NeurIPS | Jun. 2023

Invited to be reviewer in ACM MM'23.

ACM MM | Apr. 2023

Successfully defensed my Ph.D.

NUS | Mar. 2023

Thesis: Visual Relation Driven Video Question Answering. Supervisor: Prof. Tat-Seng Chua. Committee: Prof. Mohan Kankanhalli, Prof. Roger Zimmermann. Chair: Prof. Terence Sim

Featured Publications

Others

Reviewer for Conference: NeurIPS(Y23, Y24), ICLR(Y24,Y25), CVPR(Y22-Y25), ICCV(Y23,Y25), ECCV(Y22,Y24), AAAI(Y21-Y25), ACL(Y24), ACM MM(Y19-Y24), EMNLP(Y24), ACCV(Y24), ICASSP(Y21-Y22) etc.

Reviewer for Journal: PAMI, IJCV, TIP, TMM, TNNLS, ToMM, IPM, etc

Internship

[Nov. 2021-Apr. 2022]

Research Intern

Sea AI Lab (SAIL)
[Jun. 2017- Sep. 2017]

Algorithm Engineer

Weixin, Tencent

Competition

ILSVRC2017 (VID)

THU-CAS: Ranked 3rd
ILSVRC2016 (VID)

MCG-ICT-CAS: Ranked 3rd
ILSVRC2015 (CLS-LOC)

MCG-ICT-CAS: Ranked 5th

TA

[Sem 2020-2021]

CS5228: Knowledge Discovery and Data Mining
[Sem 2019-2020]

CS4243: Computer Vision and Pattern Recognition

Junbin Xiao (肖俊斌)

News

Featured Publications

EgoBlind: Towards Egocentric Visual Assistance for the Blind

MUPA: Towards Multi-Path Agentic Reasoning for Grounded Video Question Answering

EgoIntention: Visual Intention Grounding for Egocentric Assistants

Causal-Entity Reflected Egocentric Traffic Accident Video Synthesis

Intermediate Connectors and Geometric Priors for Language-Guided Affordance Segmentation on Unseen Object Categories

Bottom-Up and Top-Down Thoughts for Visual Intention Grounding

Unleashing the Power of LLMs for Medical Video Answer Localization

Question Answering Dense Video Events

EgoTextVQA: Towards Egocentric Scene-Text Video Question Answering

On the Consistency of Video Large Language Models in Temporal Comprehension

VideoQA in the era of LLMs: An Empericial Study

Scene Text Grounding for Text-based Video Question Answering

LASO: Language-guided Affordance Segmentation on 3D Object

Abductive Ego-View Accident Video Understanding for Safe Driving Perception

Can I Trust Your Answer? Visually Grounded Video Question Answering

Discovering Spatio-Temporal Rationales for Video Question Answering

Contrastive Video Question Answering via Video Graph Transformer

Transformer-Empowered Invariant Grounding for Video Question Answering

Video Question Answering: Datasets, Algorithms and Challenges

Video Graph Transformer for Video Question Answering

Equivariant and Invariant Grounding for Video Question Answering

Invariant Grounding for Video Question Answering

Video as Conditional Graph Hierarchy for Multi-Granular Question Answering

Video Visual Relation Detection via Interactive Inference

NExT-QA: Next Phase of Question Answering to Explaining Temporal Actions

Visual Relation Grounding in Videos

Annotating Object and Relations in User-Generated Videos

Others