Kumara Kahatapitiya

I am a PhD candidate at Stony Brook University, working with Prof. Michael S. Ryoo. My primary research focus is on video understanding. More-recently, I also started working on video-language models, vision transformers and video diffusion models.

During my PhD, I was an intern at Qualcomm AI Research, Google Brain, and Wormpex AI Research. Prior to this, I was a Research Assistant at University of Moratuwa, Sri-Lanka, advised by Dr. Ranga Rodrigo, where I also received my Bachelors in Electronic & Telecommunication Engineering.

[Google Scholar]    [CV]    [GitHub]    [Twitter]

profile photo
Recent News
[Mar 2024] Language Repository and MVU for Long Video Understanding are now on arXiv.
[Feb 2024] Video-conditioned Text Representations for activity recognition was accepted at CVPR 2024.
[Jan 2024] Object-Centric Diffusion for Efficient Video Editing is now on arXiv.
[Oct 2023] Grafting Vision Transformers for multi-scale and global information sharing was accepted at WACV 2024.
[July 2023] I joined Qualcomm AI Research, Amsterdam as a research intern.
[Apr 2023] SWAT, a structure-aware family of token-based models was accepted at IJCAI 2023.
[Feb 2023] Token Turing Machines for long-term memory in Transformers was accepted at CVPR 2023.
[Dec 2022] Weakly-guided Self-supervised detection pretraining was accepted at AAAI 2023.
[Sep 2022] StARformer extended to real-world robot environments was accepted at T-PAMI.
[Jul 2022] StARformer with an MDP-like inductive bias for RL was accepted at ECCV 2022.
[Mar 2022] MS-TCT for temporal action detection with CNN+Transformer embeddings was accepted at CVPR 2022.
[Feb 2022] I joined Robotics at Google as a student researcher.
[Dec 2021] I was a finalist (1/30) for the Adobe Research Fellowship 2022. Congratulations to all the winners!
[Dec 2021] Swift for real-time neural video decoding was accepted at NSDI 2022.
[Sep 2021] I am officially a PhD candidate now!
[Mar 2021] Coarse-Fine Networks for efficient temporal activity detection was accepted at CVPR 2021.
[Jan 2021] Exploiting Redundancy in CNNs for parameter reduction was accepted at WACV 2021.
Language Repository for Long Video Understanding
Kumara Kahatapitiya, Kanchana Ranasinghe, Jongwoo Park, Michael S. Ryoo
arXiv 2024
[arxiv] [code]

Understanding Long Videos in One Multimodal Language Model Pass
Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, Michael S. Ryoo
arXiv 2024
[project page] [arxiv] [code]

Object-Centric Diffusion for Efficient Video Editing
Kumara Kahatapitiya, Adil Karjauv, Davide Abati, Yuki M. Asano, Fatih Porikli, Amirhossein Habibian
arXiv 2024
[project page] [arxiv]

Selected Publications
VicTR: Video-conditioned Text Representations for Activity Recognition
Kumara Kahatapitiya, Anurag Arnab, Arsha Nagrani, Michael S. Ryoo
CVPR 2024

Grafting Vision Transformers
Jongwoo Park, Kumara Kahatapitiya, Donghyun Kim, Shivchander Sudalairaj, Quanfu Fan, Michael S. Ryoo
WACV 2024
[paper] [poster]

SWAT: Spatial Structure Within and Among Tokens
Kumara Kahatapitiya, Michael S. Ryoo
IJCAI 2023
[paper] [code] [slides]

Token Turing Machines
Michael S. Ryoo, Keerthana Gopalakrishnan, Kumara Kahatapitiya, Ted Xiao, Kanishka Rao, Austin Stone, Yao Lu, Julian Ibarz, Anurag Arnab
CVPR 2023
[paper] [code] [teaser]

Weakly-guided Self-supervised Pretraining for Temporal Activity Detection
Kumara Kahatapitiya, Zhou Ren, Haoxiang Li, Zhenyu Wu, Michael S. Ryoo, Gang Hua
AAAI 2023
[paper] [code] [talk] [poster]

StARformer: Transformer with State-Action-Reward Representations for Visual Reinforcement Learning
Jinghuan Shang, Kumara Kahatapitiya, Xiang Li, Michael S. Ryoo
[paper] [journal] [code] [talk] [poster]

MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection
Rui Dai, Srijan Das, Kumara Kahatapitiya, Michael S. Ryoo, Francois Bremond
CVPR 2022
[paper] [code] [poster]

Swift: Adaptive Video Streaming with Layered Neural Codecs
Mallesham Dasari, Kumara Kahatapitiya, Samir Das, Aruna Balasubramanian, Dimitris Samaras
NSDI 2022
[paper] [code] [slides]

Coarse-Fine Networks for Temporal Activity Detection in Videos
Kumara Kahatapitiya, Michael S. Ryoo
CVPR 2021
[paper] [code] [talk] [poster]

Exploiting the Redundancy in Convolutional Filters for Parameter Reduction
Kumara Kahatapitiya, Ranga Rodrigo
WACV 2021
[paper] [code] [talk]

Feature-dependent Cross-Connections in Multi-Path Neural Networks
Dumindu Tissera, Kasun Vithanage, Rukshan Wijesinghe, Kumara Kahatapitiya, Subha Fernando, Ranga Rodrigo
ICPR 2020

Context-Aware Automatic Occlusion Removal
Kumara Kahatapitiya, Dumindu Tissera, Ranga Rodrigo
ICIP 2019
[paper] [code]

Other Projects

  • X3D-Multigrid [code]
    A PyTorch implementation for "X3D: Expanding Architectures for Efficient Video Recognition models" [CVPR2020] with "A Multigrid Method for Efficiently Training Video Models" [CVPR2020]. In contrast to the original repository by FAIR, this repository provides a simpler, less modular and more familiar structure of implementation for faster and easier adoptation.
  • Optimal Transport in NumPy [code]
    This repository contrains a few Optimal Transport Algorithms implemented using NumPy, including "A Direct O(1/epsilon) Iteration Parallel Algorithm for Optimal Transport" [NeurIPS2019], "Computational Optimal Transport: Complexity by Accelerated Gradient Descent is better than by Sinkhorn's Algorithm" [PMLR2018] and "Lightspeed Computation of Optimal Transport" [NeurIPS2013].

Thanks Jon Barron for the template.