Last updated: 2025-04-16 04:20:04. Maintained by Weisen Jiang.

citation publish date title (pdf) review authors
2249 2023-10-05 Improved Baselines with Visual Instruction Tuning link
674 2023-11-27 MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark
for Expert AGI
link
674 2023-04-17 DETRs Beat YOLOs on Real-time Object Detection link
606 2024-01-19 Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data link
512 2023-10-12 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering link
371 2023-10-23 Wonder3D: Single Image to 3D using Cross-Domain Diffusion link
360 2023-08-01 LISA: Reasoning Segmentation via Large Language Model link
358 2023-11-07 mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration link
356 2023-11-28 MVBench: A Comprehensive Multi-modal Video Understanding Benchmark link
339 2023-09-22 Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction link
314 2023-12-12 VILA: On Pre-training for Visual Language Models link
311 2023-11-29 VBench: Comprehensive Benchmark Suite for Video Generative Models link
300 2023-11-21 SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction
and High-Quality Mesh Rendering
link
299 2023-11-28 Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character
Animation
link
297 2023-11-27 Mip-Splatting: Alias-free 3D Gaussian Splatting link
283 2023-12-14 CogAgent: A Visual Language Model for GUI Agents link
278 2023-12-21 DUSt3R: Geometric 3D Vision Made Easy link
260 2023-04-06 InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning link
255 2024-01-11 Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal
LLMs
link
253 2024-01-17 VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models link
234 2023-07-18 AnyDoor: Zero-shot Object-level Image Customization link
233 2023-07-31 MovieChat: From Dense Token to Sparse Memory for Long
Video Understanding
link
233 2023-12-04 SplaTAM: Splat Track & Map 3D Gaussians for Dense
RGB-D SLAM
link
233 2023-11-30 Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering link
224 2023-11-11 Monkey: Image Resolution and Text Label Are Important Things
for Large Multi-modal Models
link
220 2023-12-20 Generative Multimodal Models are In-Context Learners link
219 2023-12-11 Gaussian Splatting SLAM link
216 2024-01-30 YOLO-World: Real-Time Open-Vocabulary Object Detection link
212 2023-12-19 pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable
Generalizable 3D Reconstruction
link
210 2023-09-28 Text-to-3D using Gaussian Splatting link
199 2023-11-30 One-step Diffusion with Distribution Matching Distillation link
194 2023-11-21 Diffusion Model Alignment Using Direct Preference Optimization link
192 2023-03-08 Video-P2P: Video Editing with Cross-attention Control link
189 2023-11-14 Chat-UniVi: Unified Visual Representation Empowers Large Language Models with
Image and Video Understanding
link
183 2023-06-26 DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing link
182 2023-11-28 Mitigating Object Hallucinations in Large Vision-Language Models through Visual
Contrastive Decoding
link
182 2023-12-15 Point Transformer V3: Simpler Faster Stronger link
177 2023-12-07 PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding link
177 2023-11-06 GLaMM: Pixel Grounding Large Multimodal Model link
176 2023-11-20 GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting link
171 2023-11-14 One-2-3-45++: Fast Single Image to 3D Objects with Consistent
Multi-View Generation and 3D Diffusion
link
169 2023-07-13 HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models link
168 2023-11-19 LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching link
166 2023-11-27 MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model link
165 2024-01-22 SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities link
164 2023-12-26 LangSplat: 3D Language Gaussian Splatting link
164 2023-12-01 RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained
Correctional Human Feedback
link
163 2023-12-13 DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving
Scenes
link
163 2023-11-24 GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting link
161 2023-12-20 Splatter Image: Ultra-Fast Single-View 3D Reconstruction link
159 2023-12-14 Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D
Reconstruction with Transformers
link
159 2023-11-20 PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics link
159 2024-02-29 Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers link
155 2023-12-04 TimeChat: A Time-sensitive Multimodal Large Language Model for Long
Video Understanding
link
154 2023-11-29 OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via
Over-Trust Penalty and Retrospection-Allocation
link
154 2023-11-22 Compact 3D Gaussian Representation for Radiance Field link
153 2023-12-05 ReconFusion: 3D Reconstruction with Diffusion Priors link
149 2023-07-18 RepViT: Revisiting Mobile CNN From ViT Perspective link
147 2023-12-06 Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled
Feature Fields
link
142 2023-12-13 FoundationPose: Unified 6D Pose Estimation and Tracking of Novel
Objects
link
142 2023-12-04 SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes link
141 2023-12-01 Sequential Modeling Enables Scalable Learning for Large Vision Models link
139 2023-11-30 Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person
Perspectives
link
136 2023-10-23 HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination
and Visual Illusion in Large Vision-Language Models
link
136 2023-12-05 Analyzing and Improving the Training Dynamics of Diffusion Models link
136 2023-12-28 Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language
Audio and Action
link
129 2023-04-12 An Edit Friendly DDPM Noise Space: Inversion and Manipulations link
128 2023-04-03 DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion
Models
link
128 2023-12-28 Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis link
126 2023-12-04 Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation link
122 2023-12-01 EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything link
120 2023-11-10 Florence-2: Advancing a Unified Representation for a Variety of
Vision Tasks
link
118 2023-09-20 FreeU: Free Lunch in Diffusion U-Net link
118 2023-11-29 GaussianShader: 3D Gaussian Splatting with Shading Functions for Reflective
Surfaces
link
116 2023-10-17 EvalCrafter: Benchmarking and Evaluating Large Video Generation Models link
115 2023-11-24 GeoChat: Grounded Large Vision-Language Model for Remote Sensing link
114 2024-01-24 Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic
Image Restoration In the Wild
link
114 2023-11-30 Rethinking FID: Towards a Better Evaluation Metric for Image
Generation
link
112 2023-12-01 DeepCache: Accelerating Diffusion Models for Free link
112 2023-11-16 Emu Edit: Precise Image Editing via Recognition and Generation
Tasks
link
111 2023-10-12 GaussianDreamer: Fast Generation from Text to 3D Gaussians by
Bridging 2D and 3D Diffusion Models
link
110 2023-11-27 SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution link
110 2024-03-27 UniDepth: Universal Monocular Metric Depth Estimation link
108 2023-12-04 Style Aligned Image Generation via Shared Attention link
107 2023-11-28 RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness
in Text-to-3D
link
107 2023-05-14 ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding link
107 2023-03-29 InceptionNeXt: When Inception Meets ConvNeXt link
105 2023-11-29 MoMask: Generative Masked Modeling of 3D Human Motions link
105 2023-12-21 Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and
Composed Diffusion Models
link
105 2023-12-21 V?: Guided Visual Search as a Core Mechanism in
Multimodal LLMs
link
105 2023-12-11 Honeybee: Locality-enhanced Projector for Multimodal LLM link
103 2023-11-29 Driving into the Future: Multiview Visual Forecasting and Planning
with World Model for Autonomous Driving
link
101 2023-12-15 SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal
Interpretation for Earth Observation Imagery
link
99 2023-11-29 4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling link
98 2023-11-27 MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers link
98 2023-11-27 GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions link
98 2023-06-08 Grounded Text-to-Image Synthesis with Attention Refocusing link
97 2023-12-12 LMDrive: Closed-Loop End-to-End Driving with Large Language Models link
96 2023-12-12 COLMAP-Free 3D Gaussian Splatting link
95 2023-11-30 VTimeLLM: Empower LLM to Grasp Video Moments link
95 2024-02-27 VastGaussian: Vast 3D Gaussians for Large Scene Reconstruction link
95 2023-12-06 OneLLM: One Framework to Align All Modalities with Language link
95 2023-11-17 Compressed 3D Gaussian Splatting for Accelerated Novel View Synthesis link
95 2023-11-14 UFOGen: You Forward Once Large Scale Text-to-Image Generation via
Diffusion GANs
link
94 2023-12-04 GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians link
93 2023-03-21 CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation link
93 2024-06-16 Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human
Avatar Modeling
link
90 2024-03-11 DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local
Depth Normalization
link
90 2023-03-16 HIVE: Harnessing Human Feedback for Instructional Visual Editing link
89 2023-12-14 3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting link
89 2023-12-26 DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D
Vision
link
89 2023-11-26 GS-IR: 3D Gaussian Splatting for Inverse Rendering link
88 2023-12-04 GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single
Video via Animatable 3D Gaussians
link
88 2023-11-27 UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio Video
Point Cloud Time-Series and Image Recognition
link
87 2023-10-19 Putting the Object Back into Video Object Segmentation link
86 2023-12-05 GauHuman: Articulated Gaussian Splatting from Monocular Human Videos link
86 2024-06-16 OpenEQA: Embodied Question Answering in the Era of Foundation
Models
link
84 2023-05-24 RoMa: Robust Dense Feature Matching link
84 2023-09-07 InstructDiffusion: A Generalist Modeling Interface for Vision Tasks link
84 2023-12-07 DreamVideo: Composing Your Dream Videos with Customized Subject and
Motion
link
83 2023-12-06 Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle link
82 2023-11-23 SinSR: Diffusion-Based Image Super-Resolution in a Single Step link
82 2023-12-04 GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human
Novel View Synthesis
link
81 2023-11-18 Make Pixels Dance: High-Dynamic Video Generation link
81 2023-12-01 ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts link
81 2023-12-04 StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for
Virtual Try-On
link
80 2024-01-08 GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation link
79 2023-12-08 Reconstructing Hands in 3D with Transformers link
79 2024-04-08 MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding link
79 2023-11-29 HUGS: Human Gaussian Splats link
78 2023-11-28 Photo-SLAM: Real-time Simultaneous Localization and Photorealistic Mapping for Monocular
Stereo and RGB-D Cameras
link
78 2023-11-30 Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding link
76 2023-08-15 CoDeF: Content Deformation Fields for Temporally Consistent Video Processing link
75 2023-08-18 SimDA: Simple Diffusion Adapter for Efficient Video Generation link
74 2024-03-29 Rewrite the Stars link
74 2023-11-22 Using Human Feedback to Fine-tune Diffusion Models without Any
Reward Model
link
74 2023-11-27 Compositional Chain-of-Thought Prompting for Large Multimodal Models link
73 2023-11-30 LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning
and Planning
link
73 2023-12-11 Style Injection in Diffusion: A Training-free Approach for Adapting
Large-scale Diffusion Models for Style Transfer
link
73 2023-11-28 Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering link
72 2023-12-24 ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic
Manipulation
link
72 2023-12-04 PixelLM: Pixel Reasoning with Large Multimodal Model link
71 2024-03-08 SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting link
71 2023-12-15 Osprey: Pixel Understanding with Visual Instruction Tuning link
70 2023-11-28 HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting link
68 2024-04-12 Probing the 3D Awareness of Visual Foundation Models link
68 2023-12-06 Alpha-CLIP: A CLIP Model Focusing on Wherever You Want link
68 2023-12-14 Holodeck: Language Guided Generation of 3D Embodied AI Environments link
67 2024-02-05 InstanceDiffusion: Instance-level Control for Image Generation link
67 2024-06-16 Rethinking the Up-Sampling Operations in CNN-based Generative Network for
Generalizable Deepfake Detection
link
67 2023-08-23 Diffuse Attend and Segment: Unsupervised Zero-Shot Segmentation using Stable
Diffusion
link
66 2024-03-10 MACE: Mass Concept Erasure in Diffusion Models link
66 2023-12-05 Is Ego Status All You Need for Open-Loop End-to-End
Autonomous Driving?
link
66 2023-11-12 Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models link
65 2023-06-16 PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation link
64 2023-11-28 LEDITS++: Limitless Image Editing using Text-to-Image Models link
64 2023-11-27 GART: Gaussian Articulated Template Models link
63 2024-03-10 Poly Kernel Inception Network for Remote Sensing Detection link
62 2023-11-21 SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction link
62 2023-11-28 Human Gaussian Splatting: Real-time Rendering of Animatable Avatars link
61 2023-12-06 Relightable Gaussian Codec Avatars link
61 2024-03-18 Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters link
61 2023-11-22 HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data link
60 2023-12-15 Rich Human Feedback for Text-to-Image Generation link
60 2023-06-20 Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D
Scene Scale and Realism Tradeoffs for ObjectGoal Navigation
link
60 2023-12-07 Scaling Laws of Synthetic Images for Model Training ...
for Now
link
60 2023-10-02 HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic
3D Human Generation
link
60 2023-09-20 RMT: Retentive Networks Meet Vision Transformers link
60 2023-11-28 TransNeXt: Robust Foveal Visual Perception for Vision Transformers link
59 2023-11-27 Optimal Transport Aggregation for Visual Place Recognition link
59 2024-03-19 HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting link
59 2023-09-14 Generative Image Dynamics link
59 2023-12-01 VideoBooth: Diffusion-based Video Generation with Image Prompts link
59 2023-12-21 Paint3D: Paint Anything 3D with Lighting-Less Texture Diffusion Models link
58 2023-11-30 Diffusion Models Without Attention link
58 2023-04-03 RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene
Understanding
link
58 2023-06-30 DisCo: Disentangled Control for Realistic Human Dance Generation link
57 2024-03-03 3DGStream: On-the-Fly Training of 3D Gaussians for Efficient Streaming
of Photo-Realistic Free-Viewpoint Videos
link
57 2023-11-26 NeuRAD: Neural Rendering for Autonomous Driving link
55 2023-12-06 Cache Me if You Can: Accelerating Diffusion Models through
Block Caching
link
55 2023-05-25 Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models link
55 2023-12-15 PLGSLAM: Progressive Neural Scene Represenation with Local to Global
Bundle Adjustment
link
55 2023-11-27 CG-HOI: Contact-Guided 3D Human-Object Interaction Generation link
55 2024-02-08 MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis link
55 2023-11-27 SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose
Estimation
link
55 2023-09-06 Bayes' Rays: Uncertainty Quantification for Neural Radiance Fields link
54 2024-03-20 Multi-Modal Hallucination Control by Visual Information Grounding link
53 2023-11-16 DRESS: Instructing Large Vision-Language Models to Align and Interact
with Humans via Natural Language Feedback
link
53 2024-02-22 Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis link
52 2023-12-12 FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model
with Any Condition
link
52 2023-12-01 VMC: Video Motion Customization using Temporal Attention Adaption for
Text-to-Video Diffusion Models
link
52 2024-03-14 Generalized Predictive Model for Autonomous Driving link
52 2023-12-26 SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation link
52 2023-06-27 Symphonize 3D Semantic Scene Completion with Contextual Instance Queries link
51 2023-12-11 PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization link
51 2023-12-08 SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation link
51 2023-11-30 BioCLIP: A Vision Foundation Model for the Tree of
Life
link
50 2024-01-10 VLP: Vision Language Planning for Autonomous Driving link
50 2023-12-11 SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large
Language Models
link
50 2023-12-12 WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion link
50 2023-07-14 NIFTY: Neural Object Interaction Fields for Guided Human Motion
Synthesis
link
50 2023-11-29 MMA-Diffusion: MultiModal Attack on Diffusion Models link
50 2024-02-08 Editable Scene Simulation for Autonomous Driving via Collaborative LLM-Agents link
49 2023-12-10 ASH: Animatable Gaussian Splats for Efficient and Photoreal Human
Rendering
link
49 2023-11-28 Panacea: Panoramic and Controllable Video Generation for Autonomous Driving link
48 2024-03-13 Scaling Up Dynamic Human-Scene Interaction Modeling link
48 2023-12-26 One-dimensional Adapter to Rule Them All: Concepts Diffusion Models
and Erasing Applications
link
47 2024-02-27 Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent
Aligners
link
47 2023-12-06 WonderJourney: Going from Anywhere to Everywhere link
47 2023-11-27 Self-correcting LLM-controlled Diffusion Models link
47 2023-12-26 EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied
AI
link
47 2023-12-07 Free3D: Consistent Novel View Synthesis without 3D Representation link
47 2023-10-31 CapsFusion: Rethinking Image-Text Data at Scale link
47 2023-12-15 GSVA: Generalized Segmentation via Multimodal Large Language Models link
47 2023-10-17 4K4D: Real-Time 4D View Synthesis at 4K Resolution link
46 2023-11-28 A Unified Approach for Text- and Image-guided 4D Scene
Generation
link
46 2024-01-18 OMG-Seg: Is One Model Good Enough For All Segmentation? link
46 2023-12-12 Hallucination Augmented Contrastive Learning for Multimodal Large Language Model link
46 2023-12-12 VISTA-LLAMA: Reducing Hallucination in Video Language Models via Equal
Distance to Visual Tokens
link
46 2024-03-03 Logit Standardization in Knowledge Distillation link
46 2024-04-05 SpatialTracker: Tracking Any 2D Pixels in 3D Space link
46 2023-11-28 Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer link
46 2024-02-27 Neural Video Compression with Feature Modulation link
45 2024-03-18 Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning link
45 2023-11-19 Transcending Forgery Specificity with Latent Space Augmentation for Generalizable
Deepfake Detection
link
45 2023-12-18 Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization
and Physically-Based Rendering
link
44 2024-03-06 Towards Understanding Cross and Self-Attention in Stable Diffusion for
Text-Guided Image Editing
link
44 2023-12-07 VGGSfM: Visual Geometry Grounded Deep Structure From Motion link
44 2024-06-16 SEED-Bench: Benchmarking Multimodal Large Language Models link
43 2024-01-17 GARField: Group Anything with Radiance Fields link
43 2024-01-31 Binding Touch to Everything: Learning Unified Multimodal Tactile Representations link
43 2023-08-18 Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt
Training
link
43 2023-09-04 Can I Trust Your Answer? Visually Grounded Video Question
Answering
link
43 2023-12-17 Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance link
43 2023-11-27 CoSeR: Bridging Image and Language for Cognitive Super-Resolution link
43 2023-11-20 BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning link
43 2023-06-01 Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion
Models
link
42 2023-11-18 SNI-SLAM: Semantic Neural Implicit SLAM link
42 2023-11-23 GigaPose: Fast and Robust Novel Object Pose Estimation via
One Correspondence
link
41 2023-11-29 SODA: Bottleneck Diffusion Models for Representation Learning link
41 2024-02-20 Video ReCap: Recursive Captioning of Hour-Long Videos link
41 2023-10-12 UniPAD: A Universal Pre-training Paradigm for Autonomous Driving link
41 2023-11-24 DemoFusion: Democratising High-Resolution Image Generation With No $$$ link
41 2024-02-04 DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing link
40 2023-12-18 GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning link
40 2024-04-09 3D Geometry-Aware Deformable Gaussian Splatting for Dynamic View Synthesis link
40 2024-04-30 XFeat: Accelerated Features for Lightweight Image Matching link
40 2024-02-14 OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical
LVLM
link
40 2023-11-30 GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs link
40 2023-12-07 LaMPilot: An Open Benchmark Dataset for Autonomous Driving with
Language Model Programs
link
40 2024-03-11 FreGS: 3D Gaussian Splatting with Progressive Frequency Regularization link
40 2024-01-03 Instruct-Imagen: Image Generation with Multi-modal Instruction link
39 2023-12-06 HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting link
39 2024-06-16 VideoLLM-online: Online Video Large Language Model for Streaming Video link
39 2023-11-30 CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation link
39 2023-12-07 RAVE: Randomized Noise Shuffling for Fast and Consistent Video
Editing with Diffusion Models
link
39 2024-03-09 RealNet: A Feature Selection Network with Realistic Synthetic Anomaly
for Anomaly Detection
link
39 2023-12-06 MMM: Generative Masked Motion Model link
39 2023-04-13 Modeling Dense Multimodal Interactions Between Biological Pathways and Histology
for Survival Prediction
link
38 2024-02-06 EscherNet: A Generative Model for Scalable View Synthesis link
38 2023-12-02 Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction link
38 2023-12-04 Towards Learning a Generalist Model for Embodied Navigation link
38 2024-03-06 Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation link
38 2023-11-27 SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation link
38 2024-03-12 ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for
Dense Predictions
link
38 2024-02-15 GES : Generalized Exponential Splatting for Efficient Radiance Field
Rendering
link
37 2023-10-27 ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image link
37 2023-11-30 Fast ODE-based Sampling for Diffusion Models in Around 5
Steps
link
37 2024-04-10 Scaling Laws for Data Filtering-- Data Curation cannot be
Compute Agnostic
link
37 2023-09-28 CCEdit: Creative and Controllable Video Editing via Diffusion Models link
37 2024-03-19 Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial
Anomaly Detection
link
37 2023-11-24 OneFormer3D: One Transformer for Unified Point Cloud Segmentation link
37 2023-11-29 Gaussian Shell Maps for Efficient 3D Human Generation link
37 2023-12-07 Stronger Fewer & Superior: Harnessing Vision Foundation Models for
Domain Generalized Semantic Segmentation
link
37 2024-01-11 Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for
Vision Applications
link
37 2024-03-07 Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed link
36 2024-01-02 Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large
Models
link
36 2023-12-27 Forgery-aware Adaptive Transformer for Generalizable Synthetic Image Detection link
36 2024-03-04 ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models link
36 2023-09-26 Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset
and A Novel Baseline
link
36 2024-02-29 DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models link
36 2023-12-06 XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies link
36 2024-03-30 InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning link
36 2023-12-12 EditGuard: Versatile Image Watermarking for Tamper Localization and Copyright
Protection
link
36 2023-12-29 Visual Point Cloud Forecasting enables Scalable Autonomous Driving link
36 2023-12-03 ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models link
36 2023-05-19 Equivariant Multi-Modality Image Fusion link
36 2023-12-19 InstructVideo: Instructing Video Diffusion Models with Human Feedback link
36 2023-11-28 Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following link
36 2023-12-14 A Picture is Worth More Than 77 Text Tokens:
Evaluating CLIP-Style Models on Dense Captions
link
36 2024-04-15 PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI link
35 2023-09-01 CityDreamer: Compositional Generative Model of Unbounded 3D Cities link
35 2023-11-20 LION: Empowering Multimodal Large Language Model with Dual-Level Visual
Knowledge
link
35 2023-12-14 General Object Foundation Model for Images and Videos at
Scale
link
35 2023-12-14 Mosaic-SDF for 3D Generative Models link
35 2023-12-10 AM-RADIO: Agglomerative Vision Foundation Model Reduce All Domains Into
One
link
35 2023-12-17 VidToMe: Video Token Merging for Zero-Shot Video Editing link
35 2023-12-06 On the Robustness of Large Multimodal Models Against Image
Adversarial Attacks
link
35 2023-11-28 MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training link
35 2024-02-15 DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization link
35 2023-12-07 GenTron: Diffusion Transformers for Image and Video Generation link
35 2023-11-28 LLaFS: When Large Language Models Meet Few-Shot Segmentation link
34 2023-12-05 GPT4Point: A Unified Framework for Point-Language Understanding and Generation link
34 2023-12-07 Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation link
34 2024-02-26 GROUNDHOG: Grounding Large Language Models to Holistic Segmentation link
34 2023-12-19 Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image
Segmentation
link
34 2023-12-01 Grounding Everything: Emerging Localization Properties in Vision-Language Transformers link
34 2024-03-04 RegionGPT: Towards Region Understanding Vision Language Model link
34 2023-12-06 On the Diversity and Realism of Distilled Dataset: An
Efficient Dataset Distillation Paradigm
link
34 2023-11-28 Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence link
34 2024-02-29 Towards Generalizable Tumor Synthesis link
34 2023-12-04 VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point
Correspondence
link
34 2023-06-27 Detector-Free Structure from Motion link
34 2023-10-01 Pink: Unveiling the Power of Referential Comprehension for Multi-modal
LLMs
link
33 2024-03-01 Rethinking Inductive Biases for Surface Normal Estimation link
33 2024-06-16 SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes link
33 2023-11-26 BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP link
33 2024-03-01 Selective-Stereo: Adaptive Frequency Information Selection for Stereo Matching link
33 2023-12-18 SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based
Task Execution
link
33 2024-01-17 Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling
Prior
link
33 2024-04-08 LayoutLLM: Layout Instruction Tuning with Large Language Models for
Document Understanding
link
33 2023-12-12 MP5: A Multi-modal Open-ended Embodied System in Minecraft via
Active Perception
link
33 2023-11-30 ChatPose: Chatting about 3D Human Pose link
33 2023-11-27 Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion link
33 2023-11-28 Wavelet-based Fourier Information Interaction with Frequency Diffusion Adjustment for
Underwater Image Restoration
link
32 2023-12-17 SAI3D: Segment Any Instance in 3D Scenes link
32 2024-04-15 SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy
Prediction
link
32 2023-11-28 Shadows Don't Lie and Lines Can't Bend! Generative Models
don't know Projective Geometry...for now
link
32 2023-12-19 Optimizing Diffusion Noise Can Serve As Universal Motion Priors link
32 2024-04-07 Gaussian Shading: Provable Performance-Lossless Image Watermarking for Diffusion Models link
32 2023-11-29 SyncTalk: The Devil is in the Synchronization for Talking
Head Synthesis
link
32 2024-03-18 MCD: Diverse Large-Scale Multi-Campus Dataset for Robot Perception link
32 2024-03-24 EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric
View of Procedural Activities in Real World
link
31 2023-12-01 Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion link
31 2024-04-05 Koala: Key Frame-Conditioned Long Video-LLM link
31 2024-02-23 State Space Models for Event Cameras link
31 2023-12-04 COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction link
31 2023-08-20 Boosting Adversarial Transferability by Block Shuffle and Rotation link
31 2023-03-24 DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis link
31 2024-04-01 Video Interpolation with Diffusion Models link
31 2023-09-06 Diffusion-EDFs: Bi-equivariant Denoising Generative Modeling on SE(3) for Visual
Robotic Manipulation
link
31 2024-03-11 Toward Generalist Anomaly Detection via In-context Residual Learning with
Few-shot Sample Prompts
link
31 2024-03-26 Move as You Say Interact as You Can: Language-guided
Human Motion Generation with Scene Affordance
link
31 2023-12-12 CLIP as RNN: Segment Countless Visual Concepts without Training
Endeavor
link
31 2023-12-11 Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution link
30 2024-02-07 SPAD: Spatially Aware Multi-View Diffusers link
30 2023-12-06 AVID: Any-Length Video Inpainting with Diffusion Model link
30 2023-09-29 Text-Image Alignment for Diffusion-Based Perception link
30 2024-03-25 Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive
Image Fusion
link
30 2024-01-03 From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations link
30 2023-12-11 EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion link
30 2024-03-14 OneTracker: Unifying Visual Object Tracking with Foundation Models and
Efficient Tuning
link
30 2023-11-28 Embodied Multi-Modal Agent trained by an LLM from a
Parallel TextWorld
link
30 2024-01-23 The Neglected Tails in Vision-Language Models link
30 2023-12-14 Auto MC-Reward: Automated Dense Reward Design with Large Language
Models for Minecraft
link
30 2023-12-10 SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed
Human Reconstruction
link
30 2024-06-16 PARA-Drive: Parallelized Architecture for Real-time Autonomous Driving link
29 2023-12-05 Let's Think Outside the Box: Exploring Leap-of-Thought in Large
Language Models with Creative Humor Generation
link
29 2024-01-17 Vlogger: Make Your Dream A Vlog link
29 2024-03-25 RCBEVDet: Radar-camera Fusion in Bird's Eye View for 3D
Object Detection
link
29 2024-02-08 Driving Everywhere with Large Language Model Policy Adaptation link
29 2023-08-19 Noisy-Correspondence Learning for Text-to-Image Person Re-identification link
29 2023-11-27 TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models link
29 2023-12-12 PEEKABOO: Interactive Video Generation via Masked-Diffusion link
29 2023-09-27 BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning link
29 2023-11-28 AvatarGPT: All-in-One Framework for Motion Understanding Planning Generation and
Beyond
link
29 2024-03-02 TUMTraf V2X Cooperative Perception Dataset link
29 2023-12-13 Towards Text-guided 3D Scene Composition link
29 2023-11-27 SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion link
29 2024-05-29 NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the
Wild
link
29 2024-03-22 IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection link
28 2024-01-17 TextureDreamer: Image-Guided Texture Synthesis Through Geometry-Aware Diffusion link
28 2023-12-04 Aligning and Prompting Everything All at Once for Universal
Visual Perception
link
28 2023-12-03 FlashAvatar: High-fidelity Head Avatar with Efficient Gaussian Embedding link
28 2023-12-02 Diffusion Handles Enabling 3D Edits for Diffusion Models by
Lifting Activations to 3D
link
28 2024-02-27 Preserving Fairness Generalization in Deepfake Detection link
28 2024-01-01 Retrieval-Augmented Egocentric Video Captioning link
28 2023-12-28 ZONE: Zero-Shot Instruction-Guided Local Editing link
28 2024-03-15 IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation link
28 2023-12-05 Describing Differences in Image Sets with Natural Language link
28 2024-03-05 SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation
Detection
link
28 2024-03-05 PromptKD: Unsupervised Prompt Distillation for Vision-Language Models link
28 2023-12-29 FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis link
28 2023-12-07 NeRFiller: Completing Scenes via Generative 3D Inpainting link
28 2023-11-11 PerceptionGPT: Effectively Fusing Visual Perception into LLM link
27 2023-12-21 PIA: Your Personalized Image Animator via Plug-and-Play Modules in
Text-to-Image Models
link
27 2024-03-08 Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery link
27 2023-11-30 MotionEditor: Editing Video Motion via Content-Aware Diffusion link
27 2023-11-20 OmniSeg3D: Omniversal 3D Segmentation via Hierarchical Contrastive Learning link
27 2023-12-19 Jack of All Tasks Master of Many: Designing General-Purpose
Coarse-to-Fine Vision-Language Model
link
27 2023-11-18 Structure-Aware Sparse-View X-ray 3D Reconstruction link
27 2023-12-13 FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models link
27 2023-12-18 SCEdit: Efficient and Controllable Image Diffusion Generation via Skip
Connection Editing
link
27 2024-01-16 MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in
3D World
link
27 2024-01-09 DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D
Expression and Gesture Generation
link
27 2024-01-25 pix2gestalt: Amodal Segmentation by Synthesizing Wholes link
27 2024-04-08 PromptAD: Learning Prompts with only Normal Samples for Few-Shot
Anomaly Detection
link
27 2023-06-07 WOUAF: Weight Modulation for User Attribution and Fingerprinting in
Text-to-Image Diffusion Models
link
27 2023-11-22 Visual In-Context Prompting link
27 2023-11-27 Single-Model and Any-Modality for Video Object Tracking link
27 2023-11-28 Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation link
27 2023-11-28 SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion
Priors
link
27 2023-12-08 ControlRoom3D: Room Generation using Semantic Proxy Rooms link
27 2024-01-31 AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder
Reconstruction Error
link
27 2023-11-03 HIPTrack: Visual Tracking with Historical Prompts link
27 2024-03-15 Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers link
27 2024-02-29 CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition link
27 2024-02-14 Loopy-SLAM: Dense Neural SLAM with Loop Closures link
27 2024-04-11 OpenBias: Open-set Bias Detection in Text-to-Image Generative Models link
26 2024-04-01 Streaming Dense Video Captioning link
26 2024-01-29 SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design link
26 2023-12-07 Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from
Open-Source Histopathology Videos
link
25 2023-12-07 Open-Vocabulary Segmentation with Semantic-Assisted Calibration link
25 2023-06-20 CrossKD: Cross-Head Knowledge Distillation for Object Detection link
25 2023-11-29 MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning link
25 2023-11-29 Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models link
25 2023-12-07 MuRF: Multi-Baseline Radiance Fields link
25 2023-12-07 Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models link
25 2024-04-04 WorDepth: Variational Language Prior for Monocular Depth Estimation link
25 2024-03-15 Lodge: A Coarse to Fine Diffusion Network for Long
Dance Generation Guided by the Characteristic Dance Primitives
link
25 2024-02-27 VRP-SAM: SAM with Visual Reference Prompt link
25 2024-03-17 Selective Hourglass Mapping for Universal Image Restoration Based on
Diffusion Model
link
25 2023-09-12 Language Models as Black-Box Optimizers for Vision-Language Models link
25 2023-12-05 Orthogonal Adaptation for Modular Customization of Diffusion Models link
25 2023-12-11 CONFORM: Contrast is All You Need for High-Fidelity Text-to-Image
Diffusion Models
link
25 2023-12-27 SVGDreamer: Text Guided SVG Generation with Diffusion Model link
25 2023-11-25 VSCode: General Visual Salient and Camouflaged Object Detection with
2D Prompt Learning
link
25 2023-12-31 EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive
Masked Audio Gesture Modeling
link
25 2024-05-11 EMCAD: Efficient Multi-scale Convolutional Attention Decoding for Medical Image
Segmentation
link
24 2024-04-11 GoMAvatar: Efficient Animatable Human Modeling from Monocular Video Using
Gaussians-on-Mesh
link
24 2024-03-03 Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for
Point Cloud Analysis
link
24 2023-05-27 Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention
Graph in Pre-Trained Transformers
link
24 2024-02-19 Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with
Queryable Objects and Open-Set Relationships
link
24 2023-05-17 One-Prompt to Segment All Medical Images link
24 2023-08-25 Residual Denoising Diffusion Models link
24 2023-11-21 Breathing Life Into Sketches Using Text-to-Video Priors link
24 2023-12-26 Inter-X: Towards Versatile Human-Human Interaction Analysis link
24 2024-02-23 Seamless Human Motion Composition with Blended Positional Encodings link
24 2024-03-29 FairCLIP: Harnessing Fairness in Vision-Language Learning link
24 2023-12-12 DiffMorpher: Unleashing the Capability of Diffusion Models for Image
Morphing
link
24 2024-05-07 DriveWorld: 4D Pre-trained Scene Understanding via World Models for
Autonomous Driving
link
24 2024-02-28 Polos: Multimodal Metric Learning from Human Feedback for Image
Captioning
link
24 2023-12-20 A Closer Look at the Few-Shot Adaptation of Large
Vision-Language Models
link
24 2023-11-26 Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding link
24 2024-05-19 Morphological Prototyping for Unsupervised Slide Representation Learning in Computational
Pathology
link
24 2023-01-26 Discovering and Mitigating Visual Biases through Keyword Explanation link
24 2023-12-15 Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing
by Attention Modulation
link
24 2023-11-30 HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects
from Video
link
24 2024-03-11 DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations link
23 2024-05-22 ChatScene: Knowledge-Enabled Safety-Critical Scenario Generation for Autonomous Vehicles link
23 2023-11-23 Posterior Distillation Sampling link
23 2023-11-30 DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars link
23 2024-04-05 3D Facial Expressions through Analysis-by-Neural-Synthesis link
23 2024-03-27 Unleashing the Potential of SAM for Medical Adaptation via
Hierarchical Decoding
link
23 2024-05-21 OmniGlue: Generalizable Feature Matching with Foundation Model Guidance link
23 2023-08-26 Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs link
23 2023-12-25 A Recipe for Scaling up Text-to-Video Generation with Text-free
Videos
link
23 2023-10-10 MuseChat: A Conversational Music Recommendation System for Videos link
23 2024-01-15 MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary
3D Instance Segmentation
link
23 2023-12-28 Improving Image Restoration through Removing Degradations in Textual Representations link
23 2023-12-05 FINER: Flexible Spectral-bias Tuning in Implicit NEural Representation by
Variable-periodic Activation Functions
link
23 2024-03-27 ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth
Estimation
link
23 2023-12-05 Visual Program Distillation: Distilling Tools and Programmatic Reasoning into
Vision-Language Models
link
23 2023-05-25 Learning Occupancy for Monocular 3D Object Detection link
23 2023-06-15 Generative Proxemics: A Prior for 3D Social Interaction from
Images
link
23 2023-12-31 Taming Mode Collapse in Score Distillation for Text-to-3D Generation link
23 2023-12-26 HarmonyView: Harmonizing Consistency and Diversity in One-Image-to-3D link
23 2024-03-04 HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances link
23 2024-01-16 Transcending the Limit of Local Window: Advanced Super-Resolution Transformer
with Adaptive Token Dictionary
link
23 2024-04-10 HRVDA: High-Resolution Visual Document Assistant link
22 2024-06-16 Adapt or Perish: Adaptive Sparse Transformer with Attentive Feature
Refinement for Image Restoration
link
22 2024-04-11 MindBridge: A Cross-Subject Brain Decoding Framework link
22 2023-03-25 UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes link
22 2024-03-12 PeLK: Parameter-efficient Large Kernel ConvNets with Peripheral Convolution link
22 2023-03-23 NOPE: Novel Object Pose Estimation from a Single Image link
22 2024-04-29 An Aggregation-Free Federated Learning for Tackling Data Heterogeneity link
22 2024-04-01 Towards Memorization-Free Diffusion Models link
22 2023-12-12 A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond
16 Frames
link
22 2023-12-11 Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior link
22 2023-11-22 PIE-NeRF: Physics-based Interactive Elastodynamics with NeRF link
22 2023-05-19 DAP: A Dynamic Adversarial Patch for Evading Person Detectors link
22 2024-03-21 CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing link
22 2024-04-25 TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose
Representation
link
22 2023-12-28 ARTrackV2: Prompting Autoregressive Tracker Where to Look and How
to Describe
link
22 2023-12-07 Digital Life Project: Autonomous 3D Characters with Social Intelligence link
22 2023-12-01 Dense Optical Tracking: Connecting the Dots link
22 2024-01-24 LAA-Net: Localized Artifact Attention Network for Quality-Agnostic and Generalizable
Deepfake Detection
link
22 2024-03-15 RCooper: A Real-world Large-scale Dataset for Roadside Cooperative Perception link
22 2024-03-21 SpikingResformer: Bridging ResNet and Vision Transformer in Spiking Neural
Networks
link
22 2024-04-04 Decoupling Static and Hierarchical Motion Perception for Referring Video
Segmentation
link
22 2023-08-22 MatFuse: Controllable Material Generation with Diffusion Models link
22 2023-09-14 DePT: Decoupled Prompt Tuning link
22 2023-11-27 EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for
Open-World Comprehension
link
22 2024-03-28 OmniParser: A Unified Framework for Text Spotting Key Information
Extraction and Table Recognition
link
22 2023-05-31 Control4D: Efficient 4D Portrait Editing with Text link
22 2023-12-11 DreamControl: Control-Based Text-to-3D Generation with 3D Self-Prior link
22 2024-04-06 InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization link
21 2023-11-27 Text2Loc: 3D Point Cloud Localization from Natural Language link
21 2024-04-05 DiffuseMix: Label-Preserving Data Augmentation with Diffusion Models link
21 2023-11-29 Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous
Driving Applications
link
21 2024-04-01 LLMs are Good Sign Language Translators link
21 2024-04-02 ViTamin: Designing Scalable Vision Models in the Vision-Language Era link
21 2023-06-07 CorrMatch: Label Propagation via Correlation Matching for Semi-Supervised Semantic
Segmentation
link
21 2024-03-19 Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical
Images
link
21 2024-05-19 Transcriptomics-guided Slide Representation Learning in Computational Pathology link
21 2024-06-06 Matching Anything by Segmenting Anything link
21 2023-12-11 Grounded Question-Answering in Long Egocentric Videos link
21 2023-11-30 ElasticDiffusion: Training-free Arbitrary Size Image Generation through Global-Local Content
Separation
link
21 2024-04-04 MonoCD: Monocular 3D Object Detection with Complementary Depths link
21 2023-11-30 TCP:Textual-based Class-aware Prompt tuning for Visual-Language Model link
21 2024-03-28 Test-Time Domain Generalization for Face Anti-Spoofing link
21 2024-01-03 A Vision Check-up for Language Models link
21 2024-03-19 FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation link
21 2024-03-21 Volumetric Environment Representation for Vision-Language Navigation link
21 2024-01-16 TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding link
21 2024-04-29 EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars link
21 2023-11-30 Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing link
21 2023-12-14 DiffusionLight: Light Probes for Free by Painting a Chrome
Ball
link
20 2023-12-04 ArtAdapter: Text-to-Image Style Transfer using Multi-Level Style Encoder and
Explicit Adaptation
link
20 2024-03-26 Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language
Models
link
20 2023-11-20 DiffAvatar: Simulation-Ready Garment Optimization with Differentiable Simulation link
20 2023-12-18 CLOVA: A Closed-LOop Visual Assistant with Tool Usage and
Update
link
20 2024-02-01 CapHuman: Capture Your Moments in Parallel Universes link
20 2023-12-21 VCoder: Versatile Vision Encoders for Multimodal Large Language Models link
20 2022-12-06 Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis link
20 2024-02-08 Question Aware Vision Transformer for Multimodal Reasoning link
20 2024-05-03 On the Test-Time Zero-Shot Generalization of Vision-Language Models: Do
We Really Need Prompt Learning?
link
20 2024-01-04 Learning the 3D Fauna of the Web link
20 2023-11-13 Open-Vocabulary Video Anomaly Detection link
20 2024-04-16 Masked Autoencoders for Microscopy are Scalable Learners of Cellular
Biology
link
20 2024-02-27 Accelerating Diffusion Sampling with Optimized Time Steps link
20 2024-03-31 Towards Realistic Scene Generation with LiDAR Diffusion Models link
20 2024-02-27 VoCo: A Simple-yet-Effective Volume Contrastive Learning Framework for 3D
Medical Image Analysis
link
20 2023-12-21 ZeroShape: Regression-based Zero-shot Shape Reconstruction link
20 2024-04-09 GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation link
20 2024-03-22 Tri-Perspective View Decomposition for Geometry-Aware Depth Completion link
20 2023-12-09 CoGS: Controllable Gaussian Splatting link
20 2023-11-29 GenZI: Zero-Shot 3D Human-Scene Interaction Generation link
20 2023-11-17 Multimodal Representation Learning by Alternating Unimodal Adaptation link
19 2023-11-30 On Exact Inversion of DPM-Solvers link
19 2024-03-19 Task-Customized Mixture of Adapters for General Image Fusion link
19 2024-05-21 Nearest is Not Dearest: Towards Practical Defense against Quantization-conditioned
Backdoor Attacks
link
19 2024-06-16 Revamping Federated Learning Security from a Defender's Perspective: A
Unified Defense with Homomorphic Encrypted Data Space
link
19 2024-02-27 CAD-SIGNet: CAD Language Inference from Point Clouds using Layer-wise
Sketch Instance Guided Attention
link
19 2023-11-29 Generalized Large-Scale Data Condensation via Various Backbone and Statistical
Matching
link
19 2024-03-11 Exploiting Style Latent Flows for Generalizing Deepfake Video Detection link
19 2023-12-04 Readout Guidance: Learning Control from Diffusion Features link
19 2024-04-04 MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation link
19 2023-12-05 C3: High-Performance and Low-Complexity Neural Compression from a Single
Image or Video
link
19 2023-12-20 Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis link
19 2023-11-09 Mirasol3B: A Multimodal Autoregressive Model for Time-Aligned and Contextual
Modalities
link
19 2023-11-18 SecondPose: SE(3)-Consistent Dual-Stream Feature Fusion for Category-Level Pose Estimation link
19 2023-12-12 GenHowTo: Learning to Generate Actions and State Transformations from
Instructional Videos
link
19 2024-03-24 Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering
Refinement
link
19 2024-02-12 Collaborative Semantic Occupancy Prediction with Hybrid Feature Fusion in
Connected Automated Vehicles
link
19 2024-03-12 Dynamic Graph Representation with Knowledge-aware Attention for Histopathology Whole
Slide Image Analysis
link
19 2024-01-02 Towards a Simultaneous and Granular Identity-Expression Control in Personalized
Face Generation
link
19 2023-12-04 How to Configure Good In-Context Sequence for Visual Question
Answering
link
19 2024-04-01 Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic
Propagation
link
19 2023-07-24 CLIP-KD: An Empirical Study of CLIP Model Distillation link
19 2024-03-15 LightIt: Illumination Modeling and Control for Diffusion Models link
19 2024-03-19 AnySkill: Learning Open-Vocabulary Physical Skill for Interactive Agents link
19 2024-02-22 CyberDemo: Augmenting Simulated Human Demonstration for Real-World Dexterous Manipulation link
19 2024-03-25 TRIP: Temporal Residual Learning with Image Noise Prior for
Image-to-Video Diffusion Models
link
19 2023-07-14 SubT-MRS Dataset: Pushing SLAM Towards All-weather Environments link
19 2024-03-17 Bilateral Propagation Network for Depth Completion link
19 2024-06-16 MAPLM: A Real-World Large-Scale Vision-Language Benchmark for Map and
Traffic Scene Understanding
link
19 2024-02-29 PEM: Prototype-based Efficient MaskFormer for Image Segmentation link
19 2024-06-16 MMA: Multi-Modal Adapter for Vision-Language Models link
19 2022-11-15 Data Poisoning based Backdoor Attacks to Contrastive Learning link
19 2024-04-06 Diffusion Time-step Curriculum for One Image to 3D Generation link
18 2023-11-25 Towards Scalable 3D Anomaly Detection and Localization: A Benchmark
via 3D Anomaly Synthesis and A Self-Supervised Learning Network
link
18 2024-03-04 Neural Redshift: Random Networks are not Random Functions link
18 2023-12-05 HybridNeRF: Efficient Neural Rendering via Adaptive Volumetric Surfaces link
18 2024-06-16 MultiDiff: Consistent Novel View Synthesis from a Single Image link
18 2024-06-05 AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection link
18 2024-03-31 Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction link
18 2023-11-28 Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled
Semantic Features
link
18 2023-11-30 CAT-DM: Controllable Accelerated Virtual Try-on with Diffusion Model link
18 2023-12-06 Improving the Generalization of Segmentation Foundation Model under Distribution
Shift via Weakly Supervised Adaptation
link
18 2024-04-09 MoReVQA: Exploring Modular Reasoning Models for Video Question Answering link
18 2023-11-26 Insect-Foundation: A Foundation Model and Large-scale 1M Dataset for
Visual Insect Understanding
link
18 2024-04-10 MoCha-Stereo: Motif Channel Attention Network for Stereo Matching link
18 2024-04-14 DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection link
18 2024-03-26 Text Is MASS: Modeling as Stochastic Embedding for Text-Video
Retrieval
link
18 2023-12-04 PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric
Depth Estimation
link
18 2024-05-07 Tactile-Augmented Radiance Fields link
18 2024-02-28 Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis link
18 2024-04-01 MGMap: Mask-Guided Learning for Online Vectorized HD Map Construction link
18 2023-12-06 TokenCompose: Text-to-Image Diffusion with Token-level Supervision link