CVPR 2024

Last updated: 2025-05-19 23:41:50. Maintained by Weisen Jiang.

citation	publish date	title (pdf)	review
2429	2023-10-05	Improved Baselines with Visual Instruction Tuning	link
823	2023-04-17	DETRs Beat YOLOs on Real-time Object Detection	link
735	2023-11-27	MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI	link
709	2024-01-19	Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data	link
554	2023-10-12	4D Gaussian Splatting for Real-Time Dynamic Scene Rendering	link
405	2023-10-23	Wonder3D: Single Image to 3D using Cross-Domain Diffusion	link
399	2023-11-28	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	link
395	2023-08-01	LISA: Reasoning Segmentation via Large Language Model	link
375	2023-11-07	mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration	link
361	2023-09-22	Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction	link
353	2023-12-12	VILA: On Pre-training for Visual Language Models	link
348	2023-11-29	VBench: Comprehensive Benchmark Suite for Video Generative Models	link
340	2023-11-28	Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation	link
338	2023-11-21	SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering	link
323	2023-12-21	DUSt3R: Geometric 3D Vision Made Easy	link
321	2023-11-27	Mip-Splatting: Alias-free 3D Gaussian Splatting	link
321	2023-12-14	CogAgent: A Visual Language Model for GUI Agents	link
283	2024-01-11	Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs	link
279	2023-04-06	InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning	link
275	2024-01-17	VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models	link
267	2023-11-30	Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering	link
262	2023-07-31	MovieChat: From Dense Token to Sparse Memory for Long Video Understanding	link
256	2023-07-18	AnyDoor: Zero-shot Object-level Image Customization	link
250	2023-12-04	SplaTAM: Splat Track & Map 3D Gaussians for Dense RGB-D SLAM	link
246	2023-12-20	Generative Multimodal Models are In-Context Learners	link
246	2024-01-30	YOLO-World: Real-Time Open-Vocabulary Object Detection	link
242	2023-11-11	Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models	link
239	2023-12-11	Gaussian Splatting SLAM	link
228	2023-12-19	pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction	link
227	2023-11-21	Diffusion Model Alignment Using Direct Preference Optimization	link
224	2023-09-28	Text-to-3D using Gaussian Splatting	link
223	2023-11-14	Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding	link
219	2023-11-30	One-step Diffusion with Distribution Matching Distillation	link
217	2023-12-15	Point Transformer V3: Simpler Faster Stronger	link
206	2024-01-22	SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities	link
202	2023-03-08	Video-P2P: Video Editing with Cross-attention Control	link
201	2023-11-06	GLaMM: Pixel Grounding Large Multimodal Model	link
200	2023-06-26	DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing	link
198	2023-11-28	Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding	link
189	2023-11-20	GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting	link
188	2023-11-14	One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion	link
187	2023-12-07	PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding	link
185	2023-11-24	GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting	link
184	2023-11-19	LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching	link
182	2023-12-13	FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects	link
182	2023-11-27	MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model	link
179	2023-12-26	LangSplat: 3D Language Gaussian Splatting	link
178	2023-11-20	PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics	link
178	2024-02-29	Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers	link
177	2023-12-01	RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback	link
177	2023-12-13	DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes	link
177	2023-07-18	RepViT: Revisiting Mobile CNN From ViT Perspective	link
176	2023-12-20	Splatter Image: Ultra-Fast Single-View 3D Reconstruction	link
174	2023-12-04	TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding	link
173	2023-12-14	Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers	link
172	2023-12-05	ReconFusion: 3D Reconstruction with Diffusion Priors	link
172	2023-11-22	Compact 3D Gaussian Representation for Radiance Field	link
172	2023-07-13	HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models	link
167	2023-11-29	OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation	link
161	2023-12-06	Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields	link
155	2023-10-23	HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models	link
155	2023-12-05	Analyzing and Improving the Training Dynamics of Diffusion Models	link
153	2023-11-30	Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives	link
153	2023-12-04	SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes	link
153	2023-12-01	Sequential Modeling Enables Scalable Learning for Large Vision Models	link
145	2023-12-28	Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis	link
144	2023-12-28	Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action	link
143	2023-12-04	Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation	link
143	2023-11-10	Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks	link
141	2023-04-12	An Edit Friendly DDPM Noise Space: Inversion and Manipulations	link
139	2023-12-01	EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything	link
135	2023-04-03	DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models	link
133	2023-11-24	GeoChat: Grounded Large Vision-Language Model for Remote Sensing	link
131	2023-09-20	FreeU: Free Lunch in Diffusion U-Net	link
129	2023-11-16	Emu Edit: Precise Image Editing via Recognition and Generation Tasks	link
128	2024-03-27	UniDepth: Universal Monocular Metric Depth Estimation	link
127	2024-01-24	Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild	link
127	2023-11-30	Rethinking FID: Towards a Better Evaluation Metric for Image Generation	link
127	2023-10-12	GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models	link
127	2023-10-17	EvalCrafter: Benchmarking and Evaluating Large Video Generation Models	link
125	2023-11-29	GaussianShader: 3D Gaussian Splatting with Shading Functions for Reflective Surfaces	link
125	2023-11-27	SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution	link
122	2023-12-01	DeepCache: Accelerating Diffusion Models for Free	link
122	2023-12-04	Style Aligned Image Generation via Shared Attention	link
122	2023-12-21	V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs	link
120	2023-11-29	MoMask: Generative Masked Modeling of 3D Human Motions	link
118	2023-03-29	InceptionNeXt: When Inception Meets ConvNeXt	link
117	2023-12-15	SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery	link
116	2023-11-29	Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving	link
115	2023-05-14	ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding	link
113	2023-11-28	RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D	link
113	2023-11-30	VTimeLLM: Empower LLM to Grasp Video Moments	link
113	2023-12-21	Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models	link
112	2023-12-12	LMDrive: Closed-Loop End-to-End Driving with Large Language Models	link
111	2023-11-27	MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers	link
111	2023-12-11	Honeybee: Locality-enhanced Projector for Multimodal LLM	link
109	2023-12-06	OneLLM: One Framework to Align All Modalities with Language	link
107	2023-11-27	GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions	link
106	2023-12-04	GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians	link
106	2023-11-17	Compressed 3D Gaussian Splatting for Accelerated Novel View Synthesis	link
106	2023-12-12	COLMAP-Free 3D Gaussian Splatting	link
106	2024-06-16	Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling	link
106	2023-11-14	UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs	link
104	2024-02-27	VastGaussian: Vast 3D Gaussians for Large Scene Reconstruction	link
104	2023-12-14	3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting	link
104	2023-06-08	Grounded Text-to-Image Synthesis with Attention Refocusing	link
103	2023-11-29	4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling	link
103	2023-03-21	CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation	link
103	2023-03-16	HIVE: Harnessing Human Feedback for Instructional Visual Editing	link
101	2023-12-04	GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians	link
101	2023-11-27	UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio Video Point Cloud Time-Series and Image Recognition	link
100	2023-12-08	Reconstructing Hands in 3D with Transformers	link
100	2024-03-11	DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization	link
98	2023-11-23	SinSR: Diffusion-Based Image Super-Resolution in a Single Step	link
98	2023-12-26	DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision	link
98	2023-12-05	GauHuman: Articulated Gaussian Splatting from Monocular Human Videos	link
97	2023-10-19	Putting the Object Back into Video Object Segmentation	link
96	2024-06-16	OpenEQA: Embodied Question Answering in the Era of Foundation Models	link
96	2023-11-26	GS-IR: 3D Gaussian Splatting for Inverse Rendering	link
94	2024-03-29	Rewrite the Stars	link
93	2023-09-07	InstructDiffusion: A Generalist Modeling Interface for Vision Tasks	link
91	2023-12-06	Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle	link
91	2023-05-24	RoMa: Robust Dense Feature Matching	link
91	2023-12-04	GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis	link
91	2023-11-29	HUGS: Human Gaussian Splats	link
90	2023-11-18	Make Pixels Dance: High-Dynamic Video Generation	link
89	2023-12-07	DreamVideo: Composing Your Dream Videos with Customized Subject and Motion	link
89	2023-11-22	Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model	link
88	2024-01-08	GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation	link
88	2024-04-08	MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding	link
87	2023-12-04	StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On	link
85	2023-12-01	ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts	link
84	2024-03-08	SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting	link
84	2024-02-05	InstanceDiffusion: Instance-level Control for Image Generation	link
84	2023-11-30	Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding	link
83	2023-12-06	Alpha-CLIP: A CLIP Model Focusing on Wherever You Want	link
82	2023-11-28	Photo-SLAM: Real-time Simultaneous Localization and Photorealistic Mapping for Monocular Stereo and RGB-D Cameras	link
82	2023-12-11	Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer	link
81	2023-08-18	SimDA: Simple Diffusion Adapter for Efficient Video Generation	link
81	2023-08-15	CoDeF: Content Deformation Fields for Temporally Consistent Video Processing	link
80	2024-06-16	Rethinking the Up-Sampling Operations in CNN-based Generative Network for Generalizable Deepfake Detection	link
80	2024-03-10	Poly Kernel Inception Network for Remote Sensing Detection	link
80	2023-12-04	PixelLM: Pixel Reasoning with Large Multimodal Model	link
80	2023-11-28	Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering	link
80	2023-11-27	Compositional Chain-of-Thought Prompting for Large Multimodal Models	link
79	2024-04-12	Probing the 3D Awareness of Visual Foundation Models	link
79	2023-11-30	LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning	link
78	2023-12-24	ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation	link
77	2024-03-10	MACE: Mass Concept Erasure in Diffusion Models	link
77	2023-12-14	Holodeck: Language Guided Generation of 3D Embodied AI Environments	link
77	2023-12-15	Osprey: Pixel Understanding with Visual Instruction Tuning	link
76	2023-11-12	Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models	link
76	2023-11-28	TransNeXt: Robust Foveal Visual Perception for Vision Transformers	link
75	2023-09-20	RMT: Retentive Networks Meet Vision Transformers	link
75	2023-08-23	Diffuse Attend and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion	link
74	2023-06-30	DisCo: Disentangled Control for Realistic Human Dance Generation	link
73	2023-06-16	PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation	link
73	2023-11-27	GART: Gaussian Articulated Template Models	link
72	2023-11-28	HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting	link
70	2023-12-05	Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?	link
69	2023-11-28	LEDITS++: Limitless Image Editing using Text-to-Image Models	link
68	2023-06-20	Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation	link
68	2023-12-06	Relightable Gaussian Codec Avatars	link
68	2024-03-18	Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters	link
68	2023-12-12	WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion	link
67	2023-12-15	Rich Human Feedback for Text-to-Image Generation	link
67	2023-11-21	SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction	link
67	2023-11-22	HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data	link
67	2023-11-28	Human Gaussian Splatting: Real-time Rendering of Animatable Avatars	link
66	2023-10-02	HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation	link
66	2023-12-21	Paint3D: Paint Anything 3D with Lighting-Less Texture Diffusion Models	link
65	2023-12-01	VideoBooth: Diffusion-based Video Generation with Image Prompts	link
63	2024-03-19	HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting	link
63	2023-09-14	Generative Image Dynamics	link
63	2024-03-03	3DGStream: On-the-Fly Training of 3D Gaussians for Efficient Streaming of Photo-Realistic Free-Viewpoint Videos	link
63	2023-11-30	BioCLIP: A Vision Foundation Model for the Tree of Life	link
63	2024-02-27	Neural Video Compression with Feature Modulation	link
62	2024-03-20	Multi-Modal Hallucination Control by Visual Information Grounding	link
61	2023-11-30	Diffusion Models Without Attention	link
61	2023-11-27	Optimal Transport Aggregation for Visual Place Recognition	link
61	2023-04-03	RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding	link
61	2023-12-07	Scaling Laws of Synthetic Images for Model Training ... for Now	link
61	2023-11-26	NeuRAD: Neural Rendering for Autonomous Driving	link
61	2023-11-29	MMA-Diffusion: MultiModal Attack on Diffusion Models	link
61	2023-11-27	SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation	link
60	2023-12-15	PLGSLAM: Progressive Neural Scene Represenation with Local to Global Bundle Adjustment	link
59	2023-12-12	FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition	link
59	2023-12-26	EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI	link
59	2023-11-27	CG-HOI: Contact-Guided 3D Human-Object Interaction Generation	link
59	2023-09-06	Bayes' Rays: Uncertainty Quantification for Neural Radiance Fields	link
58	2023-12-10	ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering	link
58	2023-12-11	SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models	link
58	2023-11-16	DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback	link
58	2023-12-26	SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation	link
58	2024-02-08	Editable Scene Simulation for Autonomous Driving via Collaborative LLM-Agents	link
57	2023-12-06	Cache Me if You Can: Accelerating Diffusion Models through Block Caching	link
57	2023-05-25	Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models	link
57	2024-03-14	Generalized Predictive Model for Autonomous Driving	link
57	2023-11-28	Panacea: Panoramic and Controllable Video Generation for Autonomous Driving	link
57	2023-12-07	VGGSfM: Visual Geometry Grounded Deep Structure From Motion	link
57	2024-02-08	MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis	link
56	2024-02-22	Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis	link
56	2024-03-03	Logit Standardization in Knowledge Distillation	link
56	2023-06-27	Symphonize 3D Semantic Scene Completion with Contextual Instance Queries	link
56	2023-12-26	One-dimensional Adapter to Rule Them All: Concepts Diffusion Models and Erasing Applications	link
55	2024-01-10	VLP: Vision Language Planning for Autonomous Driving	link
55	2024-01-31	Binding Touch to Everything: Learning Unified Multimodal Tactile Representations	link
55	2023-11-19	Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection	link
55	2023-12-08	SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation	link
55	2024-06-16	SEED-Bench: Benchmarking Multimodal Large Language Models	link
54	2023-12-11	PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization	link
54	2023-10-31	CapsFusion: Rethinking Image-Text Data at Scale	link
54	2023-12-15	GSVA: Generalized Segmentation via Multimodal Large Language Models	link
54	2023-10-17	4K4D: Real-Time 4D View Synthesis at 4K Resolution	link
53	2023-12-01	VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models	link
53	2023-11-27	Self-correcting LLM-controlled Diffusion Models	link
53	2023-07-14	NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis	link
53	2024-04-05	SpatialTracker: Tracking Any 2D Pixels in 3D Space	link
52	2024-01-18	OMG-Seg: Is One Model Good Enough For All Segmentation?	link
51	2024-03-18	Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning	link
51	2024-03-13	Scaling Up Dynamic Human-Scene Interaction Modeling	link
51	2023-12-06	WonderJourney: Going from Anywhere to Everywhere	link
51	2023-12-12	Hallucination Augmented Contrastive Learning for Multimodal Large Language Model	link
51	2024-02-14	OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM	link
51	2024-03-06	Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing	link
51	2023-12-17	Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance	link
50	2024-02-27	Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners	link
50	2023-11-20	BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning	link
50	2023-11-23	GigaPose: Fast and Robust Novel Object Pose Estimation via One Correspondence	link
50	2023-11-28	Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer	link
49	2024-06-16	VideoLLM-online: Online Video Large Language Model for Streaming Video	link
49	2024-03-06	Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation	link
49	2024-01-11	Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications	link
49	2024-03-11	FreGS: 3D Gaussian Splatting with Progressive Frequency Regularization	link
48	2023-11-29	SODA: Bottleneck Diffusion Models for Representation Learning	link
48	2023-12-07	Free3D: Consistent Novel View Synthesis without 3D Representation	link
47	2023-12-06	HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting	link
47	2023-11-28	A Unified Approach for Text- and Image-guided 4D Scene Generation	link
47	2023-12-06	XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies	link
47	2024-01-17	GARField: Group Anything with Radiance Fields	link
47	2024-04-09	3D Geometry-Aware Deformable Gaussian Splatting for Dynamic View Synthesis	link
47	2023-12-18	Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering	link
47	2023-11-24	DemoFusion: Democratising High-Resolution Image Generation With No $$$	link
47	2023-12-12	VISTA-LLAMA: Reducing Hallucination in Video Language Models via Equal Distance to Visual Tokens	link
47	2024-03-09	RealNet: A Feature Selection Network with Realistic Synthetic Anomaly for Anomaly Detection	link
46	2023-08-18	Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training	link
46	2024-03-19	Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection	link
46	2023-09-04	Can I Trust Your Answer? Visually Grounded Video Question Answering	link
45	2024-03-12	ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions	link
45	2023-12-12	EditGuard: Versatile Image Watermarking for Tamper Localization and Copyright Protection	link
45	2023-12-07	RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models	link
45	2024-04-30	XFeat: Accelerated Features for Lightweight Image Matching	link
45	2024-02-04	DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing	link
45	2023-11-30	GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs	link
45	2023-11-27	CoSeR: Bridging Image and Language for Cognitive Super-Resolution	link
45	2023-12-06	MMM: Generative Masked Motion Model	link
44	2023-11-30	CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation	link
44	2023-12-18	GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning	link
44	2024-02-20	Video ReCap: Recursive Captioning of Hour-Long Videos	link
44	2023-10-12	UniPAD: A Universal Pre-training Paradigm for Autonomous Driving	link
44	2023-11-18	SNI-SLAM: Semantic Neural Implicit SLAM	link
44	2024-02-15	GES : Generalized Exponential Splatting for Efficient Radiance Field Rendering	link
43	2023-11-30	Fast ODE-based Sampling for Diffusion Models in Around 5 Steps	link
43	2023-11-28	MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training	link
43	2023-11-28	Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following	link
43	2024-01-03	Instruct-Imagen: Image Generation with Multi-modal Instruction	link
42	2024-02-06	EscherNet: A Generative Model for Scalable View Synthesis	link
42	2023-12-27	Forgery-aware Adaptive Transformer for Generalizable Synthetic Image Detection	link
42	2023-11-20	LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge	link
42	2023-11-27	SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation	link
42	2023-12-10	AM-RADIO: Agglomerative Vision Foundation Model Reduce All Domains Into One	link
42	2023-09-28	CCEdit: Creative and Controllable Video Editing via Diffusion Models	link
42	2023-05-19	Equivariant Multi-Modality Image Fusion	link
42	2023-12-19	InstructVideo: Instructing Video Diffusion Models with Human Feedback	link
42	2023-12-07	LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs	link
42	2024-04-15	PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI	link
41	2024-03-01	Rethinking Inductive Biases for Surface Normal Estimation	link
41	2023-12-04	Towards Learning a Generalist Model for Embodied Navigation	link
41	2023-09-26	Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline	link
41	2024-02-29	DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models	link
41	2024-04-10	Scaling Laws for Data Filtering-- Data Curation cannot be Compute Agnostic	link
41	2024-03-01	Selective-Stereo: Adaptive Frequency Information Selection for Stereo Matching	link
41	2023-11-24	OneFormer3D: One Transformer for Unified Point Cloud Segmentation	link
41	2023-12-07	Stronger Fewer & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation	link
41	2024-03-07	Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed	link
41	2023-12-14	A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions	link
41	2023-04-13	Modeling Dense Multimodal Interactions Between Biological Pathways and Histology for Survival Prediction	link
40	2023-12-02	Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction	link
40	2023-10-27	ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image	link
40	2023-12-17	VidToMe: Video Token Merging for Zero-Shot Video Editing	link
40	2023-12-29	Visual Point Cloud Forecasting enables Scalable Autonomous Driving	link
40	2024-02-29	Towards Generalizable Tumor Synthesis	link
40	2024-02-15	DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization	link
40	2024-04-07	Gaussian Shading: Provable Performance-Lossless Image Watermarking for Diffusion Models	link
40	2023-11-28	Wavelet-based Fourier Information Interaction with Frequency Diffusion Adjustment for Underwater Image Restoration	link
39	2024-03-04	ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models	link
39	2024-02-26	GROUNDHOG: Grounding Large Language Models to Holistic Segmentation	link
39	2023-12-14	General Object Foundation Model for Images and Videos at Scale	link
39	2023-12-01	Grounding Everything: Emerging Localization Properties in Vision-Language Transformers	link
39	2024-03-30	InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning	link
39	2023-12-12	MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception	link
39	2023-03-24	DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis	link
39	2023-11-29	SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis	link
39	2023-06-01	Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models	link
38	2024-01-02	Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models	link
38	2023-12-19	Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation	link
38	2024-04-08	LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding	link
38	2023-12-06	On the Robustness of Large Multimodal Models Against Image Adversarial Attacks	link
38	2023-12-03	ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models	link
38	2023-11-29	Gaussian Shell Maps for Efficient 3D Human Generation	link
38	2024-03-14	OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning	link
38	2023-11-30	ChatPose: Chatting about 3D Human Pose	link
38	2023-06-27	Detector-Free Structure from Motion	link
38	2023-12-07	GenTron: Diffusion Transformers for Image and Video Generation	link
38	2023-10-01	Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs	link
37	2023-09-01	CityDreamer: Compositional Generative Model of Unbounded 3D Cities	link
37	2023-12-17	SAI3D: Segment Any Instance in 3D Scenes	link
37	2024-04-15	SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction	link
37	2023-12-07	Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation	link
37	2023-12-14	Mosaic-SDF for 3D Generative Models	link
37	2023-12-18	SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution	link
37	2024-02-23	State Space Models for Event Cameras	link
37	2023-12-06	On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm	link
37	2023-12-04	COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction	link
37	2024-03-25	Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion	link
37	2023-11-28	Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence	link
37	2024-01-03	From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations	link
37	2023-12-19	Optimizing Diffusion Noise Can Serve As Universal Motion Priors	link
37	2023-11-27	TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models	link
37	2023-12-04	VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence	link
37	2024-03-18	MCD: Diverse Large-Scale Multi-Campus Dataset for Robot Perception	link
37	2023-11-28	LLaFS: When Large Language Models Meet Few-Shot Segmentation	link
37	2024-03-24	EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World	link
37	2023-12-11	Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution	link
36	2023-11-26	BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP	link
36	2024-01-17	Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior	link
36	2023-12-12	DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing	link
36	2023-08-20	Boosting Adversarial Transferability by Block Shuffle and Rotation	link
36	2024-03-15	IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation	link
36	2024-04-08	PromptAD: Learning Prompts with only Normal Samples for Few-Shot Anomaly Detection	link
36	2024-03-26	Move as You Say Interact as You Can: Language-guided Human Motion Generation with Scene Affordance	link
35	2024-03-08	Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery	link
35	2023-12-05	GPT4Point: A Unified Framework for Point-Language Understanding and Generation	link
35	2024-02-07	SPAD: Spatially Aware Multi-View Diffusers	link
35	2023-12-03	FlashAvatar: High-fidelity Head Avatar with Efficient Gaussian Embedding	link
35	2023-12-05	Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation	link
35	2024-03-25	RCBEVDet: Radar-camera Fusion in Bird's Eye View for 3D Object Detection	link
35	2024-04-05	Koala: Key Frame-Conditioned Long Video-LLM	link
35	2024-02-27	Preserving Fairness Generalization in Deepfake Detection	link
35	2023-12-07	Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos	link
35	2024-03-05	PromptKD: Unsupervised Prompt Distillation for Vision-Language Models	link
35	2024-03-11	Toward Generalist Anomaly Detection via In-context Residual Learning with Few-shot Sample Prompts	link
35	2023-11-27	Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion	link
35	2023-12-10	SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction	link
34	2023-11-28	Shadows Don't Lie and Lines Can't Bend! Generative Models don't know Projective Geometry...for now	link
34	2024-01-17	Vlogger: Make Your Dream A Vlog	link
34	2024-03-04	RegionGPT: Towards Region Understanding Vision Language Model	link
34	2023-08-19	Noisy-Correspondence Learning for Text-to-Image Person Re-identification	link
34	2024-03-02	TUMTraf V2X Cooperative Perception Dataset	link
34	2023-11-28	Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld	link
33	2023-12-04	Aligning and Prompting Everything All at Once for Universal Visual Perception	link
33	2024-06-16	SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes	link
33	2023-12-06	AVID: Any-Length Video Inpainting with Diffusion Model	link
33	2023-12-11	EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion	link
33	2023-11-27	Single-Model and Any-Modality for Video Object Tracking	link
33	2023-09-06	Diffusion-EDFs: Bi-equivariant Denoising Generative Modeling on SE(3) for Visual Robotic Manipulation	link
33	2024-01-23	The Neglected Tails in Vision-Language Models	link
33	2024-05-11	EMCAD: Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation	link
33	2024-06-16	PARA-Drive: Parallelized Architecture for Real-time Autonomous Driving	link
32	2024-04-01	Streaming Dense Video Captioning	link
32	2023-06-20	CrossKD: Cross-Head Knowledge Distillation for Object Detection	link
32	2023-12-01	Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion	link
32	2023-08-26	Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs	link
32	2023-12-02	Diffusion Handles Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D	link
32	2023-08-25	Residual Denoising Diffusion Models	link
32	2024-01-16	MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World	link
32	2023-09-29	Text-Image Alignment for Diffusion-Based Perception	link
32	2023-11-28	AvatarGPT: All-in-One Framework for Motion Understanding Planning Generation and Beyond	link
32	2024-04-01	Video Interpolation with Diffusion Models	link
32	2024-03-05	SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection	link
32	2023-12-29	FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis	link
32	2023-12-14	Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft	link
32	2024-03-15	Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers	link
32	2023-12-12	CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor	link
31	2024-05-22	ChatScene: Knowledge-Enabled Safety-Critical Scenario Generation for Autonomous Vehicles	link
31	2024-01-25	pix2gestalt: Amodal Segmentation by Synthesizing Wholes	link
31	2024-02-27	VRP-SAM: SAM with Visual Reference Prompt	link
31	2023-11-22	Visual In-Context Prompting	link
31	2023-01-26	Discovering and Mitigating Visual Biases through Keyword Explanation	link
31	2023-11-28	Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation	link
31	2023-12-08	ControlRoom3D: Room Generation using Semantic Proxy Rooms	link
31	2023-11-25	VSCode: General Visual Salient and Camouflaged Object Detection with 2D Prompt Learning	link
31	2023-12-13	Towards Text-guided 3D Scene Composition	link
31	2024-01-31	AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error	link
31	2023-12-07	NeRFiller: Completing Scenes via Generative 3D Inpainting	link
31	2023-11-27	SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion	link
31	2024-05-29	NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild	link
31	2024-04-11	OpenBias: Open-set Bias Detection in Text-to-Image Generative Models	link
31	2024-03-11	DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations	link
31	2024-03-22	IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection	link
30	2024-01-17	TextureDreamer: Image-Guided Texture Synthesis Through Geometry-Aware Diffusion	link
30	2023-11-30	MotionEditor: Editing Video Motion via Content-Aware Diffusion	link
30	2024-02-08	Driving Everywhere with Large Language Model Policy Adaptation	link
30	2023-11-18	Structure-Aware Sparse-View X-ray 3D Reconstruction	link
30	2023-12-28	ZONE: Zero-Shot Instruction-Guided Local Editing	link
30	2024-03-29	FairCLIP: Harnessing Fairness in Vision-Language Learning	link
30	2023-12-12	PEEKABOO: Interactive Video Generation via Masked-Diffusion	link
30	2023-12-05	Describing Differences in Image Sets with Natural Language	link
30	2023-12-15	Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation	link
30	2023-12-27	SVGDreamer: Text Guided SVG Generation with Diffusion Model	link
30	2023-11-11	PerceptionGPT: Effectively Fusing Visual Perception into LLM	link
30	2023-11-03	HIPTrack: Visual Tracking with Historical Prompts	link
29	2024-01-29	SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design	link
29	2024-04-05	3D Facial Expressions through Analysis-by-Neural-Synthesis	link
29	2023-11-29	Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models	link
29	2024-04-29	An Aggregation-Free Federated Learning for Tackling Data Heterogeneity	link
29	2024-05-21	OmniGlue: Generalizable Feature Matching with Foundation Model Guidance	link
29	2023-11-20	OmniSeg3D: Omniversal 3D Segmentation via Hierarchical Contrastive Learning	link
29	2023-11-21	Breathing Life Into Sketches Using Text-to-Video Priors	link
29	2023-12-19	Jack of All Tasks Master of Many: Designing General-Purpose Coarse-to-Fine Vision-Language Model	link
29	2024-01-15	MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation	link
29	2024-01-01	Retrieval-Augmented Egocentric Video Captioning	link
29	2023-12-13	FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models	link
29	2023-12-05	Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models	link
29	2023-12-31	EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling	link
29	2023-08-22	MatFuse: Controllable Material Generation with Diffusion Models	link
29	2024-02-29	CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition	link
29	2024-02-14	Loopy-SLAM: Dense Neural SLAM with Loop Closures	link
28	2023-12-07	Open-Vocabulary Segmentation with Semantic-Assisted Calibration	link
28	2024-04-11	GoMAvatar: Efficient Animatable Human Modeling from Monocular Video Using Gaussians-on-Mesh	link
28	2024-06-16	Adapt or Perish: Adaptive Sparse Transformer with Attentive Feature Refinement for Image Restoration	link
28	2023-11-29	MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning	link
28	2023-12-07	Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models	link
28	2023-05-17	One-Prompt to Segment All Medical Images	link
28	2024-02-23	Seamless Human Motion Composition with Blended Positional Encodings	link
28	2023-12-18	SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing	link
28	2024-03-15	Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives	link
28	2024-01-09	DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation	link
28	2023-11-30	TCP:Textual-based Class-aware Prompt tuning for Visual-Language Model	link
28	2023-12-05	FINER: Flexible Spectral-bias Tuning in Implicit NEural Representation by Variable-periodic Activation Functions	link
28	2023-09-12	Language Models as Black-Box Optimizers for Vision-Language Models	link
28	2023-06-07	WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models	link
28	2024-05-19	Morphological Prototyping for Unsupervised Slide Representation Learning in Computational Pathology	link
28	2023-12-01	Dense Optical Tracking: Connecting the Dots	link
28	2023-09-27	BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning	link
28	2023-11-28	SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors	link
28	2023-11-30	HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video	link
27	2024-04-05	DiffuseMix: Label-Preserving Data Augmentation with Diffusion Models	link
27	2023-12-21	PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models	link
27	2023-11-23	Posterior Distillation Sampling	link
27	2023-11-30	DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars	link
27	2024-03-27	Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding	link
27	2024-04-25	TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation	link
27	2024-03-17	Selective Hourglass Mapping for Universal Image Restoration Based on Diffusion Model	link
27	2023-12-20	A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models	link
27	2023-07-24	CLIP-KD: An Empirical Study of CLIP Model Distillation	link
27	2023-06-15	Generative Proxemics: A Prior for 3D Social Interaction from Images	link
27	2024-03-21	SpikingResformer: Bridging ResNet and Vision Transformer in Spiking Neural Networks	link
27	2023-12-31	Taming Mode Collapse in Score Distillation for Text-to-3D Generation	link
27	2023-11-30	Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing	link
27	2023-09-14	DePT: Decoupled Prompt Tuning	link
27	2024-01-16	Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary	link
26	2024-03-03	Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis	link
26	2023-05-27	Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers	link
26	2024-04-04	WorDepth: Variational Language Prior for Monocular Depth Estimation	link
26	2023-12-25	A Recipe for Scaling up Text-to-Video Generation with Text-free Videos	link
26	2023-10-10	MuseChat: A Conversational Music Recommendation System for Videos	link
26	2023-12-26	Inter-X: Towards Versatile Human-Human Interaction Analysis	link
26	2024-04-16	Masked Autoencoders for Microscopy are Scalable Learners of Cellular Biology	link
26	2024-05-07	DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving	link
26	2023-12-05	Orthogonal Adaptation for Modular Customization of Diffusion Models	link
26	2024-03-27	ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation	link
26	2024-03-19	FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation	link
26	2023-12-11	CONFORM: Contrast is All You Need for High-Fidelity Text-to-Image Diffusion Models	link
26	2023-11-26	Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding	link
26	2024-03-28	OmniParser: A Unified Framework for Text Spotting Key Information Extraction and Table Recognition	link
25	2023-11-27	Text2Loc: 3D Point Cloud Localization from Natural Language	link
25	2023-06-26	ParameterNet: Parameters Are All You Need for Large-scale Visual Pretraining of Mobile Networks	link
25	2024-04-01	LLMs are Good Sign Language Translators	link
25	2023-06-07	CorrMatch: Label Propagation via Correlation Matching for Semi-Supervised Semantic Segmentation	link
25	2024-03-19	Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images	link
25	2024-03-12	PeLK: Parameter-efficient Large Kernel ConvNets with Peripheral Convolution	link
25	2023-03-23	NOPE: Novel Object Pose Estimation from a Single Image	link
25	2023-12-07	MuRF: Multi-Baseline Radiance Fields	link
25	2024-02-19	Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships	link
25	2024-04-01	Towards Memorization-Free Diffusion Models	link
25	2024-03-21	CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing	link
25	2023-12-28	Improving Image Restoration through Removing Degradations in Textual Representations	link
25	2024-04-04	MonoCD: Monocular 3D Object Detection with Complementary Depths	link
25	2024-03-31	Towards Realistic Scene Generation with LiDAR Diffusion Models	link
25	2024-02-27	VoCo: A Simple-yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis	link
25	2023-12-28	ARTrackV2: Prompting Autoregressive Tracker Where to Look and How to Describe	link
25	2024-03-12	Dynamic Graph Representation with Knowledge-aware Attention for Histopathology Whole Slide Image Analysis	link
25	2024-03-21	Volumetric Environment Representation for Vision-Language Navigation	link
25	2023-05-25	Learning Occupancy for Monocular 3D Object Detection	link
25	2024-01-24	LAA-Net: Localized Artifact Attention Network for Quality-Agnostic and Generalizable Deepfake Detection	link
25	2024-03-15	RCooper: A Real-world Large-scale Dataset for Roadside Cooperative Perception	link
25	2024-03-04	HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances	link
25	2023-11-23	PointOBB: Learning Oriented Object Detection via Single Point Supervision	link
25	2023-11-28	End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames	link
25	2023-12-11	DreamControl: Control-Based Text-to-3D Generation with 3D Self-Prior	link
25	2024-04-06	InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization	link
24	2023-12-04	ArtAdapter: Text-to-Image Style Transfer using Multi-Level Style Encoder and Explicit Adaptation	link
24	2024-04-02	ViTamin: Designing Scalable Vision Models in the Vision-Language Era	link
24	2024-04-11	MindBridge: A Cross-Subject Brain Decoding Framework	link
24	2023-11-20	DiffAvatar: Simulation-Ready Garment Optimization with Differentiable Simulation	link
24	2024-05-19	Transcriptomics-guided Slide Representation Learning in Computational Pathology	link
24	2023-12-21	VCoder: Versatile Vision Encoders for Multimodal Large Language Models	link
24	2023-12-12	A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames	link
24	2024-03-11	Exploiting Style Latent Flows for Generalizing Deepfake Video Detection	link
24	2023-05-19	DAP: A Dynamic Adversarial Patch for Evading Person Detectors	link
24	2023-12-05	Alchemist: Parametric Control of Material Properties with Diffusion Models	link
24	2023-11-30	ElasticDiffusion: Training-free Arbitrary Size Image Generation through Global-Local Content Separation	link
24	2023-11-18	SecondPose: SE(3)-Consistent Dual-Stream Feature Fusion for Category-Level Pose Estimation	link
24	2024-03-24	Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement	link
24	2024-01-03	A Vision Check-up for Language Models	link
24	2024-02-28	Polos: Multimodal Metric Learning from Human Feedback for Image Captioning	link
24	2023-12-07	Digital Life Project: Autonomous 3D Characters with Social Intelligence	link
24	2024-03-25	SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer	link
24	2024-04-10	HRVDA: High-Resolution Visual Document Assistant	link
24	2024-03-28	Infrared Small Target Detection with Scale and Location Sensitivity	link
23	2023-03-25	UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes	link
23	2023-12-11	Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior	link
23	2023-11-22	PIE-NeRF: Physics-based Interactive Elastodynamics with NeRF	link
23	2023-12-11	Grounded Question-Answering in Long Egocentric Videos	link
23	2024-01-04	Learning the 3D Fauna of the Web	link
23	2024-04-02	Bidirectional Multi-Scale Implicit Neural Representations for Image Deraining	link
23	2023-12-05	C3: High-Performance and Low-Complexity Neural Compression from a Single Image or Video	link
23	2024-04-29	EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars	link
23	2024-04-09	GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation	link
23	2024-04-04	Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation	link
23	2023-12-26	HarmonyView: Harmonizing Consistency and Diversity in One-Image-to-3D	link
23	2024-03-17	Bilateral Propagation Network for Depth Completion	link
23	2023-11-27	EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension	link
23	2023-12-14	LEMON: Learning 3D Human-Object Interaction Relation from 2D Images	link
23	2023-12-14	DiffusionLight: Light Probes for Free by Painting a Chrome Ball	link
23	2023-05-31	Control4D: Efficient 4D Portrait Editing with Text	link
23	2024-06-16	MMA: Multi-Modal Adapter for Vision-Language Models	link
22	2023-11-25	Towards Scalable 3D Anomaly Detection and Localization: A Benchmark via 3D Anomaly Synthesis and A Self-Supervised Learning Network	link
22	2023-11-29	Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications	link
22	2024-06-16	MultiDiff: Consistent Novel View Synthesis from a Single Image	link
22	2024-06-16	Revamping Federated Learning Security from a Defender's Perspective: A Unified Defense with Homomorphic Encrypted Data Space	link
22	2024-02-01	CapHuman: Capture Your Moments in Parallel Universes	link
22	2024-06-06	Matching Anything by Segmenting Anything	link
22	2024-04-04	MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation	link
22	2024-05-03	On the Test-Time Zero-Shot Generalization of Vision-Language Models: Do We Really Need Prompt Learning?	link
22	2024-03-28	Mitigating Motion Blur in Neural Radiance Fields with Events and Frames	link
22	2024-04-10	MoCha-Stereo: Motif Channel Attention Network for Stereo Matching	link
22	2023-12-04	PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation	link
22	2023-11-30	Can Protective Perturbation Safeguard Personal Data from Being Exploited by Stable Diffusion?	link
22	2023-12-20	Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis	link
22	2024-02-27	Accelerating Diffusion Sampling with Optimized Time Steps	link
22	2024-02-28	Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis	link
22	2024-03-28	Test-Time Domain Generalization for Face Anti-Spoofing	link
22	2023-12-21	ZeroShape: Regression-based Zero-shot Shape Reconstruction	link
22	2024-04-01	Bridging Remote Sensors with Multisensor Geospatial Foundation Models	link
22	2024-02-29	SeD: Semantic-Aware Discriminator for Image Super-Resolution	link
22	2024-01-16	TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding	link
22	2024-04-01	Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation	link
22	2023-12-12	RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation	link
22	2024-03-24	SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking	link
22	2023-11-29	One-Shot Open Affordance Learning with Foundation Models	link
22	2024-03-22	Tri-Perspective View Decomposition for Geometry-Aware Depth Completion	link
22	2024-03-25	TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models	link
22	2024-03-08	Frequency-Adaptive Dilated Convolution for Semantic Segmentation	link
22	2024-03-01	RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization	link
22	2024-06-16	MAPLM: A Real-World Large-Scale Vision-Language Benchmark for Map and Traffic Scene Understanding	link
21	2023-11-30	On Exact Inversion of DPM-Solvers	link
21	2024-03-19	Task-Customized Mixture of Adapters for General Image Fusion	link
21	2024-03-19	Zero-Reference Low-Light Enhancement via Physical Quadruple Priors	link
21	2024-03-26	Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models	link
21	2023-12-20	SpecNeRF: Gaussian Directional Encoding for Specular Reflections	link
21	2024-06-05	AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection	link
21	2023-11-22	Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer	link
21	2023-12-18	CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update	link
21	2023-11-29	Generalized Large-Scale Data Condensation via Various Backbone and Statistical Matching	link
21	2023-12-04	Readout Guidance: Learning Control from Diffusion Features	link
21	2023-12-06	Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation	link
21	2023-12-07	Prompt Highlighter: Interactive Control for Multi-Modal LLMs	link
21	2023-11-13	Open-Vocabulary Video Anomaly Detection	link
21	2023-12-28	Amodal Ground Truth and Completion in the Wild	link
21	2023-11-30	Portrait4D: Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data	link
21	2024-03-20	DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception	link
21	2024-02-12	Collaborative Semantic Occupancy Prediction with Hybrid Feature Fusion in Connected Automated Vehicles	link
21	2023-11-17	High-fidelity Person-centric Subject-to-Image Synthesis	link
21	2024-03-04	One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models	link
21	2024-03-15	Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification	link
21	2024-03-15	LightIt: Illumination Modeling and Control for Diffusion Models	link
21	2024-03-19	AnySkill: Learning Open-Vocabulary Physical Skill for Interactive Agents	link
21	2023-11-27	Source-Free Domain Adaptation with Frozen Multimodal Foundation Model	link