Kumara Kahatapitiya

I am a PhD candidate at Stony Brook University, working with Prof. Michael S. Ryoo. My primary research focus is on efficient video representations, for both video understanding and generation. I have worked on fine-grained activity localization, long-video reasoning with large multimodal models, and diffusion-based video editing/generation.

During my PhD, I was an intern at Meta GenAI, Qualcomm AI Research and Google DeepMind. Prior to this, I was a Research Assistant at University of Moratuwa, Sri-Lanka, advised by Dr. Ranga Rodrigo, where I also received my Bachelors in Electronic & Telecommunication Engineering.

[Google Scholar]    [CV]    [GitHub]    [Twitter]
kkahatapitiy [at] cs.stonybrook.edu

profile photo
Recent News
[Nov 2024] AdaCache for speeding-up video DiTs and MarDini for video generation with AR-Diffusion are now on arXiv.
[Oct 2024] Early versions of LangRepo and LVNet will appear at NeurIPS 2024 workshop on Video-Language Models.
[Oct 2024] An early version of LLaRA will appear at CoRL 2024 workshop on Language and Robot Learning.
[Jul 2024] Object-Centric Diffusion for Efficient Video Editing was accepted at ECCV 2024.
[Jun 2024] I joined Meta GenAI as a research scientist intern.
[Mar 2024] MVU for Long Video Understanding are now on arXiv.
[Feb 2024] Video-conditioned Text Representations for activity recognition was accepted at CVPR 2024.
[Oct 2023] Grafting Vision Transformers for multi-scale and global information sharing was accepted at WACV 2024.
[July 2023] I joined Qualcomm AI Research, Amsterdam as a research intern.
[Apr 2023] SWAT, a structure-aware family of token-based models was accepted at IJCAI 2023.
[Feb 2023] Token Turing Machines for long-term memory in Transformers was accepted at CVPR 2023.
[Dec 2022] SSDet for weakly-guided Self-supervised detection pretraining was accepted at AAAI 2023.
[Jul 2022] StARformer with an MDP-like inductive bias for RL was accepted at ECCV 2022 and T-PAMI.
[Mar 2022] MS-TCT for temporal action detection with CNN+Transformer embeddings was accepted at CVPR 2022.
[Feb 2022] I joined Google Deepmind as a student researcher.
Preprints
Adaptive Caching for Faster Video Generation with Diffusion Transformers
Kumara Kahatapitiya, Haozhe Liu, Sen He, Ding Liu, Menglin Jia, Chenyang Zhang, Michael S. Ryoo, Tian Xie
arXiv 2024
[project page] [preprint] [code]

MarDini: Masked Auto-Regressive Diffusion for Video Generation at Scale
Haozhe Liu, Shikun Liu, Zijian Zhou, Mengmeng Xu, Yanping Xie, Xiao Han, Juan C. Pérez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, Jui-Chieh Wu, Sen He, Tao Xiang, Jürgen Schmidhuber, Juan-Manuel Pérez-Rúa
arXiv 2024
[project page] [preprint]

Understanding Long Videos in One Multimodal Language Model Pass
Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, Michael S. Ryoo
arXiv 2024
[project page] [preprint] [code] [webinar]

Selected publications
Language Repository for Long Video Understanding
Kumara Kahatapitiya, Kanchana Ranasinghe, Jongwoo Park, Michael S. Ryoo
NeurIPS 2024 workshops
[paper] [code] [webinar]

Too many frames, not all useful: Efficient Strategies for Long-form Video QA
Jongwoo Park, Kanchana Ranasinghe, Kumara Kahatapitiya, Wonjeong Ryoo, Donghyun Kim, Michael S. Ryoo
NeurIPS 2024 workshops
[paper] [code] [webinar]

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, Michael S. Ryoo
CoRL 2024 workshops
[paper] [code]

Object-Centric Diffusion for Efficient Video Editing
Kumara Kahatapitiya, Adil Karjauv, Davide Abati, Yuki M. Asano, Fatih Porikli, Amirhossein Habibian
ECCV 2024
[project page] [paper] [poster] [talk]

VicTR: Video-conditioned Text Representations for Activity Recognition
Kumara Kahatapitiya, Anurag Arnab, Arsha Nagrani, Michael S. Ryoo
CVPR 2024
[paper] [poster] [talk]

Grafting Vision Transformers
Jongwoo Park, Kumara Kahatapitiya, Donghyun Kim, Shivchander Sudalairaj, Quanfu Fan, Michael S. Ryoo
WACV 2024
[paper] [poster]

SWAT: Spatial Structure Within and Among Tokens
Kumara Kahatapitiya, Michael S. Ryoo
IJCAI 2023
[paper] [code] [slides]

Token Turing Machines
Michael S. Ryoo, Keerthana Gopalakrishnan, Kumara Kahatapitiya, Ted Xiao, Kanishka Rao, Austin Stone, Yao Lu, Julian Ibarz, Anurag Arnab
CVPR 2023
[paper] [code] [teaser]

Weakly-guided Self-supervised Pretraining for Temporal Activity Detection
Kumara Kahatapitiya, Zhou Ren, Haoxiang Li, Zhenyu Wu, Michael S. Ryoo, Gang Hua
AAAI 2023
[paper] [code] [talk] [poster]

StARformer: Transformer with State-Action-Reward Representations for Visual Reinforcement Learning
Jinghuan Shang, Kumara Kahatapitiya, Xiang Li, Michael S. Ryoo
ECCV 2022, TPAMI
[paper] [journal] [code] [talk] [poster]

MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection
Rui Dai, Srijan Das, Kumara Kahatapitiya, Michael S. Ryoo, Francois Bremond
CVPR 2022
[paper] [code] [poster]

Swift: Adaptive Video Streaming with Layered Neural Codecs
Mallesham Dasari, Kumara Kahatapitiya, Samir Das, Aruna Balasubramanian, Dimitris Samaras
NSDI 2022
[paper] [code] [slides]

Coarse-Fine Networks for Temporal Activity Detection in Videos
Kumara Kahatapitiya, Michael S. Ryoo
CVPR 2021
[paper] [code] [talk] [poster]

Exploiting the Redundancy in Convolutional Filters for Parameter Reduction
Kumara Kahatapitiya, Ranga Rodrigo
WACV 2021
[paper] [code] [talk]

Other Projects

  • X3D-Multigrid [code]
    A PyTorch implementation for "X3D: Expanding Architectures for Efficient Video Recognition models" [CVPR2020] with "A Multigrid Method for Efficiently Training Video Models" [CVPR2020]. In contrast to the original repository by FAIR, this repository provides a simpler, less modular and more familiar structure of implementation for faster and easier adoptation.
  • Optimal Transport in NumPy [code]
    This repository contrains a few Optimal Transport Algorithms implemented using NumPy, including "A Direct O(1/epsilon) Iteration Parallel Algorithm for Optimal Transport" [NeurIPS2019], "Computational Optimal Transport: Complexity by Accelerated Gradient Descent is better than by Sinkhorn's Algorithm" [PMLR2018] and "Lightspeed Computation of Optimal Transport" [NeurIPS2013].
Teaching


Thanks Jon Barron for the template.