2249 |
2023-10-05 |
Improved Baselines with Visual Instruction Tuning |
link |
|
674 |
2023-11-27 |
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI |
link |
|
674 |
2023-04-17 |
DETRs Beat YOLOs on Real-time Object Detection |
link |
|
606 |
2024-01-19 |
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data |
link |
|
512 |
2023-10-12 |
4D Gaussian Splatting for Real-Time Dynamic Scene Rendering |
link |
|
371 |
2023-10-23 |
Wonder3D: Single Image to 3D using Cross-Domain Diffusion |
link |
|
360 |
2023-08-01 |
LISA: Reasoning Segmentation via Large Language Model |
link |
|
358 |
2023-11-07 |
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration |
link |
|
356 |
2023-11-28 |
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark |
link |
|
339 |
2023-09-22 |
Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction |
link |
|
314 |
2023-12-12 |
VILA: On Pre-training for Visual Language Models |
link |
|
311 |
2023-11-29 |
VBench: Comprehensive Benchmark Suite for Video Generative Models |
link |
|
300 |
2023-11-21 |
SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering |
link |
|
299 |
2023-11-28 |
Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation |
link |
|
297 |
2023-11-27 |
Mip-Splatting: Alias-free 3D Gaussian Splatting |
link |
|
283 |
2023-12-14 |
CogAgent: A Visual Language Model for GUI Agents |
link |
|
278 |
2023-12-21 |
DUSt3R: Geometric 3D Vision Made Easy |
link |
|
260 |
2023-04-06 |
InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning |
link |
|
255 |
2024-01-11 |
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs |
link |
|
253 |
2024-01-17 |
VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models |
link |
|
234 |
2023-07-18 |
AnyDoor: Zero-shot Object-level Image Customization |
link |
|
233 |
2023-07-31 |
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding |
link |
|
233 |
2023-12-04 |
SplaTAM: Splat Track & Map 3D Gaussians for Dense RGB-D SLAM |
link |
|
233 |
2023-11-30 |
Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering |
link |
|
224 |
2023-11-11 |
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models |
link |
|
220 |
2023-12-20 |
Generative Multimodal Models are In-Context Learners |
link |
|
219 |
2023-12-11 |
Gaussian Splatting SLAM |
link |
|
216 |
2024-01-30 |
YOLO-World: Real-Time Open-Vocabulary Object Detection |
link |
|
212 |
2023-12-19 |
pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction |
link |
|
210 |
2023-09-28 |
Text-to-3D using Gaussian Splatting |
link |
|
199 |
2023-11-30 |
One-step Diffusion with Distribution Matching Distillation |
link |
|
194 |
2023-11-21 |
Diffusion Model Alignment Using Direct Preference Optimization |
link |
|
192 |
2023-03-08 |
Video-P2P: Video Editing with Cross-attention Control |
link |
|
189 |
2023-11-14 |
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding |
link |
|
183 |
2023-06-26 |
DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing |
link |
|
182 |
2023-11-28 |
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding |
link |
|
182 |
2023-12-15 |
Point Transformer V3: Simpler Faster Stronger |
link |
|
177 |
2023-12-07 |
PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding |
link |
|
177 |
2023-11-06 |
GLaMM: Pixel Grounding Large Multimodal Model |
link |
|
176 |
2023-11-20 |
GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting |
link |
|
171 |
2023-11-14 |
One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion |
link |
|
169 |
2023-07-13 |
HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models |
link |
|
168 |
2023-11-19 |
LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching |
link |
|
166 |
2023-11-27 |
MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model |
link |
|
165 |
2024-01-22 |
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities |
link |
|
164 |
2023-12-26 |
LangSplat: 3D Language Gaussian Splatting |
link |
|
164 |
2023-12-01 |
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback |
link |
|
163 |
2023-12-13 |
DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes |
link |
|
163 |
2023-11-24 |
GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting |
link |
|
161 |
2023-12-20 |
Splatter Image: Ultra-Fast Single-View 3D Reconstruction |
link |
|
159 |
2023-12-14 |
Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers |
link |
|
159 |
2023-11-20 |
PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics |
link |
|
159 |
2024-02-29 |
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers |
link |
|
155 |
2023-12-04 |
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding |
link |
|
154 |
2023-11-29 |
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation |
link |
|
154 |
2023-11-22 |
Compact 3D Gaussian Representation for Radiance Field |
link |
|
153 |
2023-12-05 |
ReconFusion: 3D Reconstruction with Diffusion Priors |
link |
|
149 |
2023-07-18 |
RepViT: Revisiting Mobile CNN From ViT Perspective |
link |
|
147 |
2023-12-06 |
Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields |
link |
|
142 |
2023-12-13 |
FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects |
link |
|
142 |
2023-12-04 |
SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes |
link |
|
141 |
2023-12-01 |
Sequential Modeling Enables Scalable Learning for Large Vision Models |
link |
|
139 |
2023-11-30 |
Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives |
link |
|
136 |
2023-10-23 |
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models |
link |
|
136 |
2023-12-05 |
Analyzing and Improving the Training Dynamics of Diffusion Models |
link |
|
136 |
2023-12-28 |
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action |
link |
|
129 |
2023-04-12 |
An Edit Friendly DDPM Noise Space: Inversion and Manipulations |
link |
|
128 |
2023-04-03 |
DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models |
link |
|
128 |
2023-12-28 |
Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis |
link |
|
126 |
2023-12-04 |
Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation |
link |
|
122 |
2023-12-01 |
EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything |
link |
|
120 |
2023-11-10 |
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks |
link |
|
118 |
2023-09-20 |
FreeU: Free Lunch in Diffusion U-Net |
link |
|
118 |
2023-11-29 |
GaussianShader: 3D Gaussian Splatting with Shading Functions for Reflective Surfaces |
link |
|
116 |
2023-10-17 |
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models |
link |
|
115 |
2023-11-24 |
GeoChat: Grounded Large Vision-Language Model for Remote Sensing |
link |
|
114 |
2024-01-24 |
Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild |
link |
|
114 |
2023-11-30 |
Rethinking FID: Towards a Better Evaluation Metric for Image Generation |
link |
|
112 |
2023-12-01 |
DeepCache: Accelerating Diffusion Models for Free |
link |
|
112 |
2023-11-16 |
Emu Edit: Precise Image Editing via Recognition and Generation Tasks |
link |
|
111 |
2023-10-12 |
GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models |
link |
|
110 |
2023-11-27 |
SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution |
link |
|
110 |
2024-03-27 |
UniDepth: Universal Monocular Metric Depth Estimation |
link |
|
108 |
2023-12-04 |
Style Aligned Image Generation via Shared Attention |
link |
|
107 |
2023-11-28 |
RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D |
link |
|
107 |
2023-05-14 |
ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding |
link |
|
107 |
2023-03-29 |
InceptionNeXt: When Inception Meets ConvNeXt |
link |
|
105 |
2023-11-29 |
MoMask: Generative Masked Modeling of 3D Human Motions |
link |
|
105 |
2023-12-21 |
Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models |
link |
|
105 |
2023-12-21 |
V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs |
link |
|
105 |
2023-12-11 |
Honeybee: Locality-enhanced Projector for Multimodal LLM |
link |
|
103 |
2023-11-29 |
Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving |
link |
|
101 |
2023-12-15 |
SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery |
link |
|
99 |
2023-11-29 |
4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling |
link |
|
98 |
2023-11-27 |
MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers |
link |
|
98 |
2023-11-27 |
GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions |
link |
|
98 |
2023-06-08 |
Grounded Text-to-Image Synthesis with Attention Refocusing |
link |
|
97 |
2023-12-12 |
LMDrive: Closed-Loop End-to-End Driving with Large Language Models |
link |
|
96 |
2023-12-12 |
COLMAP-Free 3D Gaussian Splatting |
link |
|
95 |
2023-11-30 |
VTimeLLM: Empower LLM to Grasp Video Moments |
link |
|
95 |
2024-02-27 |
VastGaussian: Vast 3D Gaussians for Large Scene Reconstruction |
link |
|
95 |
2023-12-06 |
OneLLM: One Framework to Align All Modalities with Language |
link |
|
95 |
2023-11-17 |
Compressed 3D Gaussian Splatting for Accelerated Novel View Synthesis |
link |
|
95 |
2023-11-14 |
UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs |
link |
|
94 |
2023-12-04 |
GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians |
link |
|
93 |
2023-03-21 |
CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation |
link |
|
93 |
2024-06-16 |
Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling |
link |
|
90 |
2024-03-11 |
DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization |
link |
|
90 |
2023-03-16 |
HIVE: Harnessing Human Feedback for Instructional Visual Editing |
link |
|
89 |
2023-12-14 |
3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting |
link |
|
89 |
2023-12-26 |
DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision |
link |
|
89 |
2023-11-26 |
GS-IR: 3D Gaussian Splatting for Inverse Rendering |
link |
|
88 |
2023-12-04 |
GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians |
link |
|
88 |
2023-11-27 |
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio Video Point Cloud Time-Series and Image Recognition |
link |
|
87 |
2023-10-19 |
Putting the Object Back into Video Object Segmentation |
link |
|
86 |
2023-12-05 |
GauHuman: Articulated Gaussian Splatting from Monocular Human Videos |
link |
|
86 |
2024-06-16 |
OpenEQA: Embodied Question Answering in the Era of Foundation Models |
link |
|
84 |
2023-05-24 |
RoMa: Robust Dense Feature Matching |
link |
|
84 |
2023-09-07 |
InstructDiffusion: A Generalist Modeling Interface for Vision Tasks |
link |
|
84 |
2023-12-07 |
DreamVideo: Composing Your Dream Videos with Customized Subject and Motion |
link |
|
83 |
2023-12-06 |
Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle |
link |
|
82 |
2023-11-23 |
SinSR: Diffusion-Based Image Super-Resolution in a Single Step |
link |
|
82 |
2023-12-04 |
GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis |
link |
|
81 |
2023-11-18 |
Make Pixels Dance: High-Dynamic Video Generation |
link |
|
81 |
2023-12-01 |
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts |
link |
|
81 |
2023-12-04 |
StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On |
link |
|
80 |
2024-01-08 |
GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation |
link |
|
79 |
2023-12-08 |
Reconstructing Hands in 3D with Transformers |
link |
|
79 |
2024-04-08 |
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding |
link |
|
79 |
2023-11-29 |
HUGS: Human Gaussian Splats |
link |
|
78 |
2023-11-28 |
Photo-SLAM: Real-time Simultaneous Localization and Photorealistic Mapping for Monocular Stereo and RGB-D Cameras |
link |
|
78 |
2023-11-30 |
Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding |
link |
|
76 |
2023-08-15 |
CoDeF: Content Deformation Fields for Temporally Consistent Video Processing |
link |
|
75 |
2023-08-18 |
SimDA: Simple Diffusion Adapter for Efficient Video Generation |
link |
|
74 |
2024-03-29 |
Rewrite the Stars |
link |
|
74 |
2023-11-22 |
Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model |
link |
|
74 |
2023-11-27 |
Compositional Chain-of-Thought Prompting for Large Multimodal Models |
link |
|
73 |
2023-11-30 |
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning |
link |
|
73 |
2023-12-11 |
Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer |
link |
|
73 |
2023-11-28 |
Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering |
link |
|
72 |
2023-12-24 |
ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation |
link |
|
72 |
2023-12-04 |
PixelLM: Pixel Reasoning with Large Multimodal Model |
link |
|
71 |
2024-03-08 |
SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting |
link |
|
71 |
2023-12-15 |
Osprey: Pixel Understanding with Visual Instruction Tuning |
link |
|
70 |
2023-11-28 |
HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting |
link |
|
68 |
2024-04-12 |
Probing the 3D Awareness of Visual Foundation Models |
link |
|
68 |
2023-12-06 |
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want |
link |
|
68 |
2023-12-14 |
Holodeck: Language Guided Generation of 3D Embodied AI Environments |
link |
|
67 |
2024-02-05 |
InstanceDiffusion: Instance-level Control for Image Generation |
link |
|
67 |
2024-06-16 |
Rethinking the Up-Sampling Operations in CNN-based Generative Network for Generalizable Deepfake Detection |
link |
|
67 |
2023-08-23 |
Diffuse Attend and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion |
link |
|
66 |
2024-03-10 |
MACE: Mass Concept Erasure in Diffusion Models |
link |
|
66 |
2023-12-05 |
Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving? |
link |
|
66 |
2023-11-12 |
Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models |
link |
|
65 |
2023-06-16 |
PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation |
link |
|
64 |
2023-11-28 |
LEDITS++: Limitless Image Editing using Text-to-Image Models |
link |
|
64 |
2023-11-27 |
GART: Gaussian Articulated Template Models |
link |
|
63 |
2024-03-10 |
Poly Kernel Inception Network for Remote Sensing Detection |
link |
|
62 |
2023-11-21 |
SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction |
link |
|
62 |
2023-11-28 |
Human Gaussian Splatting: Real-time Rendering of Animatable Avatars |
link |
|
61 |
2023-12-06 |
Relightable Gaussian Codec Avatars |
link |
|
61 |
2024-03-18 |
Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters |
link |
|
61 |
2023-11-22 |
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data |
link |
|
60 |
2023-12-15 |
Rich Human Feedback for Text-to-Image Generation |
link |
|
60 |
2023-06-20 |
Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation |
link |
|
60 |
2023-12-07 |
Scaling Laws of Synthetic Images for Model Training ... for Now |
link |
|
60 |
2023-10-02 |
HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation |
link |
|
60 |
2023-09-20 |
RMT: Retentive Networks Meet Vision Transformers |
link |
|
60 |
2023-11-28 |
TransNeXt: Robust Foveal Visual Perception for Vision Transformers |
link |
|
59 |
2023-11-27 |
Optimal Transport Aggregation for Visual Place Recognition |
link |
|
59 |
2024-03-19 |
HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting |
link |
|
59 |
2023-09-14 |
Generative Image Dynamics |
link |
|
59 |
2023-12-01 |
VideoBooth: Diffusion-based Video Generation with Image Prompts |
link |
|
59 |
2023-12-21 |
Paint3D: Paint Anything 3D with Lighting-Less Texture Diffusion Models |
link |
|
58 |
2023-11-30 |
Diffusion Models Without Attention |
link |
|
58 |
2023-04-03 |
RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding |
link |
|
58 |
2023-06-30 |
DisCo: Disentangled Control for Realistic Human Dance Generation |
link |
|
57 |
2024-03-03 |
3DGStream: On-the-Fly Training of 3D Gaussians for Efficient Streaming of Photo-Realistic Free-Viewpoint Videos |
link |
|
57 |
2023-11-26 |
NeuRAD: Neural Rendering for Autonomous Driving |
link |
|
55 |
2023-12-06 |
Cache Me if You Can: Accelerating Diffusion Models through Block Caching |
link |
|
55 |
2023-05-25 |
Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models |
link |
|
55 |
2023-12-15 |
PLGSLAM: Progressive Neural Scene Represenation with Local to Global Bundle Adjustment |
link |
|
55 |
2023-11-27 |
CG-HOI: Contact-Guided 3D Human-Object Interaction Generation |
link |
|
55 |
2024-02-08 |
MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis |
link |
|
55 |
2023-11-27 |
SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation |
link |
|
55 |
2023-09-06 |
Bayes' Rays: Uncertainty Quantification for Neural Radiance Fields |
link |
|
54 |
2024-03-20 |
Multi-Modal Hallucination Control by Visual Information Grounding |
link |
|
53 |
2023-11-16 |
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback |
link |
|
53 |
2024-02-22 |
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis |
link |
|
52 |
2023-12-12 |
FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition |
link |
|
52 |
2023-12-01 |
VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models |
link |
|
52 |
2024-03-14 |
Generalized Predictive Model for Autonomous Driving |
link |
|
52 |
2023-12-26 |
SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation |
link |
|
52 |
2023-06-27 |
Symphonize 3D Semantic Scene Completion with Contextual Instance Queries |
link |
|
51 |
2023-12-11 |
PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization |
link |
|
51 |
2023-12-08 |
SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation |
link |
|
51 |
2023-11-30 |
BioCLIP: A Vision Foundation Model for the Tree of Life |
link |
|
50 |
2024-01-10 |
VLP: Vision Language Planning for Autonomous Driving |
link |
|
50 |
2023-12-11 |
SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models |
link |
|
50 |
2023-12-12 |
WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion |
link |
|
50 |
2023-07-14 |
NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis |
link |
|
50 |
2023-11-29 |
MMA-Diffusion: MultiModal Attack on Diffusion Models |
link |
|
50 |
2024-02-08 |
Editable Scene Simulation for Autonomous Driving via Collaborative LLM-Agents |
link |
|
49 |
2023-12-10 |
ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering |
link |
|
49 |
2023-11-28 |
Panacea: Panoramic and Controllable Video Generation for Autonomous Driving |
link |
|
48 |
2024-03-13 |
Scaling Up Dynamic Human-Scene Interaction Modeling |
link |
|
48 |
2023-12-26 |
One-dimensional Adapter to Rule Them All: Concepts Diffusion Models and Erasing Applications |
link |
|
47 |
2024-02-27 |
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners |
link |
|
47 |
2023-12-06 |
WonderJourney: Going from Anywhere to Everywhere |
link |
|
47 |
2023-11-27 |
Self-correcting LLM-controlled Diffusion Models |
link |
|
47 |
2023-12-26 |
EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI |
link |
|
47 |
2023-12-07 |
Free3D: Consistent Novel View Synthesis without 3D Representation |
link |
|
47 |
2023-10-31 |
CapsFusion: Rethinking Image-Text Data at Scale |
link |
|
47 |
2023-12-15 |
GSVA: Generalized Segmentation via Multimodal Large Language Models |
link |
|
47 |
2023-10-17 |
4K4D: Real-Time 4D View Synthesis at 4K Resolution |
link |
|
46 |
2023-11-28 |
A Unified Approach for Text- and Image-guided 4D Scene Generation |
link |
|
46 |
2024-01-18 |
OMG-Seg: Is One Model Good Enough For All Segmentation? |
link |
|
46 |
2023-12-12 |
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model |
link |
|
46 |
2023-12-12 |
VISTA-LLAMA: Reducing Hallucination in Video Language Models via Equal Distance to Visual Tokens |
link |
|
46 |
2024-03-03 |
Logit Standardization in Knowledge Distillation |
link |
|
46 |
2024-04-05 |
SpatialTracker: Tracking Any 2D Pixels in 3D Space |
link |
|
46 |
2023-11-28 |
Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer |
link |
|
46 |
2024-02-27 |
Neural Video Compression with Feature Modulation |
link |
|
45 |
2024-03-18 |
Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning |
link |
|
45 |
2023-11-19 |
Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection |
link |
|
45 |
2023-12-18 |
Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering |
link |
|
44 |
2024-03-06 |
Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing |
link |
|
44 |
2023-12-07 |
VGGSfM: Visual Geometry Grounded Deep Structure From Motion |
link |
|
44 |
2024-06-16 |
SEED-Bench: Benchmarking Multimodal Large Language Models |
link |
|
43 |
2024-01-17 |
GARField: Group Anything with Radiance Fields |
link |
|
43 |
2024-01-31 |
Binding Touch to Everything: Learning Unified Multimodal Tactile Representations |
link |
|
43 |
2023-08-18 |
Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training |
link |
|
43 |
2023-09-04 |
Can I Trust Your Answer? Visually Grounded Video Question Answering |
link |
|
43 |
2023-12-17 |
Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance |
link |
|
43 |
2023-11-27 |
CoSeR: Bridging Image and Language for Cognitive Super-Resolution |
link |
|
43 |
2023-11-20 |
BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning |
link |
|
43 |
2023-06-01 |
Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models |
link |
|
42 |
2023-11-18 |
SNI-SLAM: Semantic Neural Implicit SLAM |
link |
|
42 |
2023-11-23 |
GigaPose: Fast and Robust Novel Object Pose Estimation via One Correspondence |
link |
|
41 |
2023-11-29 |
SODA: Bottleneck Diffusion Models for Representation Learning |
link |
|
41 |
2024-02-20 |
Video ReCap: Recursive Captioning of Hour-Long Videos |
link |
|
41 |
2023-10-12 |
UniPAD: A Universal Pre-training Paradigm for Autonomous Driving |
link |
|
41 |
2023-11-24 |
DemoFusion: Democratising High-Resolution Image Generation With No $$$ |
link |
|
41 |
2024-02-04 |
DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing |
link |
|
40 |
2023-12-18 |
GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning |
link |
|
40 |
2024-04-09 |
3D Geometry-Aware Deformable Gaussian Splatting for Dynamic View Synthesis |
link |
|
40 |
2024-04-30 |
XFeat: Accelerated Features for Lightweight Image Matching |
link |
|
40 |
2024-02-14 |
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM |
link |
|
40 |
2023-11-30 |
GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs |
link |
|
40 |
2023-12-07 |
LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs |
link |
|
40 |
2024-03-11 |
FreGS: 3D Gaussian Splatting with Progressive Frequency Regularization |
link |
|
40 |
2024-01-03 |
Instruct-Imagen: Image Generation with Multi-modal Instruction |
link |
|
39 |
2023-12-06 |
HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting |
link |
|
39 |
2024-06-16 |
VideoLLM-online: Online Video Large Language Model for Streaming Video |
link |
|
39 |
2023-11-30 |
CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation |
link |
|
39 |
2023-12-07 |
RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models |
link |
|
39 |
2024-03-09 |
RealNet: A Feature Selection Network with Realistic Synthetic Anomaly for Anomaly Detection |
link |
|
39 |
2023-12-06 |
MMM: Generative Masked Motion Model |
link |
|
39 |
2023-04-13 |
Modeling Dense Multimodal Interactions Between Biological Pathways and Histology for Survival Prediction |
link |
|
38 |
2024-02-06 |
EscherNet: A Generative Model for Scalable View Synthesis |
link |
|
38 |
2023-12-02 |
Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction |
link |
|
38 |
2023-12-04 |
Towards Learning a Generalist Model for Embodied Navigation |
link |
|
38 |
2024-03-06 |
Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation |
link |
|
38 |
2023-11-27 |
SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation |
link |
|
38 |
2024-03-12 |
ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions |
link |
|
38 |
2024-02-15 |
GES : Generalized Exponential Splatting for Efficient Radiance Field Rendering |
link |
|
37 |
2023-10-27 |
ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image |
link |
|
37 |
2023-11-30 |
Fast ODE-based Sampling for Diffusion Models in Around 5 Steps |
link |
|
37 |
2024-04-10 |
Scaling Laws for Data Filtering-- Data Curation cannot be Compute Agnostic |
link |
|
37 |
2023-09-28 |
CCEdit: Creative and Controllable Video Editing via Diffusion Models |
link |
|
37 |
2024-03-19 |
Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection |
link |
|
37 |
2023-11-24 |
OneFormer3D: One Transformer for Unified Point Cloud Segmentation |
link |
|
37 |
2023-11-29 |
Gaussian Shell Maps for Efficient 3D Human Generation |
link |
|
37 |
2023-12-07 |
Stronger Fewer & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation |
link |
|
37 |
2024-01-11 |
Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications |
link |
|
37 |
2024-03-07 |
Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed |
link |
|
36 |
2024-01-02 |
Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models |
link |
|
36 |
2023-12-27 |
Forgery-aware Adaptive Transformer for Generalizable Synthetic Image Detection |
link |
|
36 |
2024-03-04 |
ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models |
link |
|
36 |
2023-09-26 |
Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline |
link |
|
36 |
2024-02-29 |
DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models |
link |
|
36 |
2023-12-06 |
XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies |
link |
|
36 |
2024-03-30 |
InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning |
link |
|
36 |
2023-12-12 |
EditGuard: Versatile Image Watermarking for Tamper Localization and Copyright Protection |
link |
|
36 |
2023-12-29 |
Visual Point Cloud Forecasting enables Scalable Autonomous Driving |
link |
|
36 |
2023-12-03 |
ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models |
link |
|
36 |
2023-05-19 |
Equivariant Multi-Modality Image Fusion |
link |
|
36 |
2023-12-19 |
InstructVideo: Instructing Video Diffusion Models with Human Feedback |
link |
|
36 |
2023-11-28 |
Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following |
link |
|
36 |
2023-12-14 |
A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions |
link |
|
36 |
2024-04-15 |
PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI |
link |
|
35 |
2023-09-01 |
CityDreamer: Compositional Generative Model of Unbounded 3D Cities |
link |
|
35 |
2023-11-20 |
LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge |
link |
|
35 |
2023-12-14 |
General Object Foundation Model for Images and Videos at Scale |
link |
|
35 |
2023-12-14 |
Mosaic-SDF for 3D Generative Models |
link |
|
35 |
2023-12-10 |
AM-RADIO: Agglomerative Vision Foundation Model Reduce All Domains Into One |
link |
|
35 |
2023-12-17 |
VidToMe: Video Token Merging for Zero-Shot Video Editing |
link |
|
35 |
2023-12-06 |
On the Robustness of Large Multimodal Models Against Image Adversarial Attacks |
link |
|
35 |
2023-11-28 |
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training |
link |
|
35 |
2024-02-15 |
DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization |
link |
|
35 |
2023-12-07 |
GenTron: Diffusion Transformers for Image and Video Generation |
link |
|
35 |
2023-11-28 |
LLaFS: When Large Language Models Meet Few-Shot Segmentation |
link |
|
34 |
2023-12-05 |
GPT4Point: A Unified Framework for Point-Language Understanding and Generation |
link |
|
34 |
2023-12-07 |
Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation |
link |
|
34 |
2024-02-26 |
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation |
link |
|
34 |
2023-12-19 |
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation |
link |
|
34 |
2023-12-01 |
Grounding Everything: Emerging Localization Properties in Vision-Language Transformers |
link |
|
34 |
2024-03-04 |
RegionGPT: Towards Region Understanding Vision Language Model |
link |
|
34 |
2023-12-06 |
On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm |
link |
|
34 |
2023-11-28 |
Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence |
link |
|
34 |
2024-02-29 |
Towards Generalizable Tumor Synthesis |
link |
|
34 |
2023-12-04 |
VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence |
link |
|
34 |
2023-06-27 |
Detector-Free Structure from Motion |
link |
|
34 |
2023-10-01 |
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs |
link |
|
33 |
2024-03-01 |
Rethinking Inductive Biases for Surface Normal Estimation |
link |
|
33 |
2024-06-16 |
SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes |
link |
|
33 |
2023-11-26 |
BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP |
link |
|
33 |
2024-03-01 |
Selective-Stereo: Adaptive Frequency Information Selection for Stereo Matching |
link |
|
33 |
2023-12-18 |
SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution |
link |
|
33 |
2024-01-17 |
Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior |
link |
|
33 |
2024-04-08 |
LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding |
link |
|
33 |
2023-12-12 |
MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception |
link |
|
33 |
2023-11-30 |
ChatPose: Chatting about 3D Human Pose |
link |
|
33 |
2023-11-27 |
Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion |
link |
|
33 |
2023-11-28 |
Wavelet-based Fourier Information Interaction with Frequency Diffusion Adjustment for Underwater Image Restoration |
link |
|
32 |
2023-12-17 |
SAI3D: Segment Any Instance in 3D Scenes |
link |
|
32 |
2024-04-15 |
SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction |
link |
|
32 |
2023-11-28 |
Shadows Don't Lie and Lines Can't Bend! Generative Models don't know Projective Geometry...for now |
link |
|
32 |
2023-12-19 |
Optimizing Diffusion Noise Can Serve As Universal Motion Priors |
link |
|
32 |
2024-04-07 |
Gaussian Shading: Provable Performance-Lossless Image Watermarking for Diffusion Models |
link |
|
32 |
2023-11-29 |
SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis |
link |
|
32 |
2024-03-18 |
MCD: Diverse Large-Scale Multi-Campus Dataset for Robot Perception |
link |
|
32 |
2024-03-24 |
EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World |
link |
|
31 |
2023-12-01 |
Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion |
link |
|
31 |
2024-04-05 |
Koala: Key Frame-Conditioned Long Video-LLM |
link |
|
31 |
2024-02-23 |
State Space Models for Event Cameras |
link |
|
31 |
2023-12-04 |
COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction |
link |
|
31 |
2023-08-20 |
Boosting Adversarial Transferability by Block Shuffle and Rotation |
link |
|
31 |
2023-03-24 |
DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis |
link |
|
31 |
2024-04-01 |
Video Interpolation with Diffusion Models |
link |
|
31 |
2023-09-06 |
Diffusion-EDFs: Bi-equivariant Denoising Generative Modeling on SE(3) for Visual Robotic Manipulation |
link |
|
31 |
2024-03-11 |
Toward Generalist Anomaly Detection via In-context Residual Learning with Few-shot Sample Prompts |
link |
|
31 |
2024-03-26 |
Move as You Say Interact as You Can: Language-guided Human Motion Generation with Scene Affordance |
link |
|
31 |
2023-12-12 |
CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor |
link |
|
31 |
2023-12-11 |
Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution |
link |
|
30 |
2024-02-07 |
SPAD: Spatially Aware Multi-View Diffusers |
link |
|
30 |
2023-12-06 |
AVID: Any-Length Video Inpainting with Diffusion Model |
link |
|
30 |
2023-09-29 |
Text-Image Alignment for Diffusion-Based Perception |
link |
|
30 |
2024-03-25 |
Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion |
link |
|
30 |
2024-01-03 |
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations |
link |
|
30 |
2023-12-11 |
EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion |
link |
|
30 |
2024-03-14 |
OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning |
link |
|
30 |
2023-11-28 |
Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld |
link |
|
30 |
2024-01-23 |
The Neglected Tails in Vision-Language Models |
link |
|
30 |
2023-12-14 |
Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft |
link |
|
30 |
2023-12-10 |
SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction |
link |
|
30 |
2024-06-16 |
PARA-Drive: Parallelized Architecture for Real-time Autonomous Driving |
link |
|
29 |
2023-12-05 |
Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation |
link |
|
29 |
2024-01-17 |
Vlogger: Make Your Dream A Vlog |
link |
|
29 |
2024-03-25 |
RCBEVDet: Radar-camera Fusion in Bird's Eye View for 3D Object Detection |
link |
|
29 |
2024-02-08 |
Driving Everywhere with Large Language Model Policy Adaptation |
link |
|
29 |
2023-08-19 |
Noisy-Correspondence Learning for Text-to-Image Person Re-identification |
link |
|
29 |
2023-11-27 |
TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models |
link |
|
29 |
2023-12-12 |
PEEKABOO: Interactive Video Generation via Masked-Diffusion |
link |
|
29 |
2023-09-27 |
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning |
link |
|
29 |
2023-11-28 |
AvatarGPT: All-in-One Framework for Motion Understanding Planning Generation and Beyond |
link |
|
29 |
2024-03-02 |
TUMTraf V2X Cooperative Perception Dataset |
link |
|
29 |
2023-12-13 |
Towards Text-guided 3D Scene Composition |
link |
|
29 |
2023-11-27 |
SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion |
link |
|
29 |
2024-05-29 |
NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild |
link |
|
29 |
2024-03-22 |
IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection |
link |
|
28 |
2024-01-17 |
TextureDreamer: Image-Guided Texture Synthesis Through Geometry-Aware Diffusion |
link |
|
28 |
2023-12-04 |
Aligning and Prompting Everything All at Once for Universal Visual Perception |
link |
|
28 |
2023-12-03 |
FlashAvatar: High-fidelity Head Avatar with Efficient Gaussian Embedding |
link |
|
28 |
2023-12-02 |
Diffusion Handles Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D |
link |
|
28 |
2024-02-27 |
Preserving Fairness Generalization in Deepfake Detection |
link |
|
28 |
2024-01-01 |
Retrieval-Augmented Egocentric Video Captioning |
link |
|
28 |
2023-12-28 |
ZONE: Zero-Shot Instruction-Guided Local Editing |
link |
|
28 |
2024-03-15 |
IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation |
link |
|
28 |
2023-12-05 |
Describing Differences in Image Sets with Natural Language |
link |
|
28 |
2024-03-05 |
SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection |
link |
|
28 |
2024-03-05 |
PromptKD: Unsupervised Prompt Distillation for Vision-Language Models |
link |
|
28 |
2023-12-29 |
FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis |
link |
|
28 |
2023-12-07 |
NeRFiller: Completing Scenes via Generative 3D Inpainting |
link |
|
28 |
2023-11-11 |
PerceptionGPT: Effectively Fusing Visual Perception into LLM |
link |
|
27 |
2023-12-21 |
PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models |
link |
|
27 |
2024-03-08 |
Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery |
link |
|
27 |
2023-11-30 |
MotionEditor: Editing Video Motion via Content-Aware Diffusion |
link |
|
27 |
2023-11-20 |
OmniSeg3D: Omniversal 3D Segmentation via Hierarchical Contrastive Learning |
link |
|
27 |
2023-12-19 |
Jack of All Tasks Master of Many: Designing General-Purpose Coarse-to-Fine Vision-Language Model |
link |
|
27 |
2023-11-18 |
Structure-Aware Sparse-View X-ray 3D Reconstruction |
link |
|
27 |
2023-12-13 |
FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models |
link |
|
27 |
2023-12-18 |
SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing |
link |
|
27 |
2024-01-16 |
MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World |
link |
|
27 |
2024-01-09 |
DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation |
link |
|
27 |
2024-01-25 |
pix2gestalt: Amodal Segmentation by Synthesizing Wholes |
link |
|
27 |
2024-04-08 |
PromptAD: Learning Prompts with only Normal Samples for Few-Shot Anomaly Detection |
link |
|
27 |
2023-06-07 |
WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models |
link |
|
27 |
2023-11-22 |
Visual In-Context Prompting |
link |
|
27 |
2023-11-27 |
Single-Model and Any-Modality for Video Object Tracking |
link |
|
27 |
2023-11-28 |
Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation |
link |
|
27 |
2023-11-28 |
SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors |
link |
|
27 |
2023-12-08 |
ControlRoom3D: Room Generation using Semantic Proxy Rooms |
link |
|
27 |
2024-01-31 |
AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error |
link |
|
27 |
2023-11-03 |
HIPTrack: Visual Tracking with Historical Prompts |
link |
|
27 |
2024-03-15 |
Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers |
link |
|
27 |
2024-02-29 |
CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition |
link |
|
27 |
2024-02-14 |
Loopy-SLAM: Dense Neural SLAM with Loop Closures |
link |
|
27 |
2024-04-11 |
OpenBias: Open-set Bias Detection in Text-to-Image Generative Models |
link |
|
26 |
2024-04-01 |
Streaming Dense Video Captioning |
link |
|
26 |
2024-01-29 |
SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design |
link |
|
26 |
2023-12-07 |
Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos |
link |
|
25 |
2023-12-07 |
Open-Vocabulary Segmentation with Semantic-Assisted Calibration |
link |
|
25 |
2023-06-20 |
CrossKD: Cross-Head Knowledge Distillation for Object Detection |
link |
|
25 |
2023-11-29 |
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning |
link |
|
25 |
2023-11-29 |
Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models |
link |
|
25 |
2023-12-07 |
MuRF: Multi-Baseline Radiance Fields |
link |
|
25 |
2023-12-07 |
Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models |
link |
|
25 |
2024-04-04 |
WorDepth: Variational Language Prior for Monocular Depth Estimation |
link |
|
25 |
2024-03-15 |
Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives |
link |
|
25 |
2024-02-27 |
VRP-SAM: SAM with Visual Reference Prompt |
link |
|
25 |
2024-03-17 |
Selective Hourglass Mapping for Universal Image Restoration Based on Diffusion Model |
link |
|
25 |
2023-09-12 |
Language Models as Black-Box Optimizers for Vision-Language Models |
link |
|
25 |
2023-12-05 |
Orthogonal Adaptation for Modular Customization of Diffusion Models |
link |
|
25 |
2023-12-11 |
CONFORM: Contrast is All You Need for High-Fidelity Text-to-Image Diffusion Models |
link |
|
25 |
2023-12-27 |
SVGDreamer: Text Guided SVG Generation with Diffusion Model |
link |
|
25 |
2023-11-25 |
VSCode: General Visual Salient and Camouflaged Object Detection with 2D Prompt Learning |
link |
|
25 |
2023-12-31 |
EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling |
link |
|
25 |
2024-05-11 |
EMCAD: Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation |
link |
|
24 |
2024-04-11 |
GoMAvatar: Efficient Animatable Human Modeling from Monocular Video Using Gaussians-on-Mesh |
link |
|
24 |
2024-03-03 |
Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis |
link |
|
24 |
2023-05-27 |
Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers |
link |
|
24 |
2024-02-19 |
Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships |
link |
|
24 |
2023-05-17 |
One-Prompt to Segment All Medical Images |
link |
|
24 |
2023-08-25 |
Residual Denoising Diffusion Models |
link |
|
24 |
2023-11-21 |
Breathing Life Into Sketches Using Text-to-Video Priors |
link |
|
24 |
2023-12-26 |
Inter-X: Towards Versatile Human-Human Interaction Analysis |
link |
|
24 |
2024-02-23 |
Seamless Human Motion Composition with Blended Positional Encodings |
link |
|
24 |
2024-03-29 |
FairCLIP: Harnessing Fairness in Vision-Language Learning |
link |
|
24 |
2023-12-12 |
DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing |
link |
|
24 |
2024-05-07 |
DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving |
link |
|
24 |
2024-02-28 |
Polos: Multimodal Metric Learning from Human Feedback for Image Captioning |
link |
|
24 |
2023-12-20 |
A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models |
link |
|
24 |
2023-11-26 |
Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding |
link |
|
24 |
2024-05-19 |
Morphological Prototyping for Unsupervised Slide Representation Learning in Computational Pathology |
link |
|
24 |
2023-01-26 |
Discovering and Mitigating Visual Biases through Keyword Explanation |
link |
|
24 |
2023-12-15 |
Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation |
link |
|
24 |
2023-11-30 |
HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video |
link |
|
24 |
2024-03-11 |
DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations |
link |
|
23 |
2024-05-22 |
ChatScene: Knowledge-Enabled Safety-Critical Scenario Generation for Autonomous Vehicles |
link |
|
23 |
2023-11-23 |
Posterior Distillation Sampling |
link |
|
23 |
2023-11-30 |
DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars |
link |
|
23 |
2024-04-05 |
3D Facial Expressions through Analysis-by-Neural-Synthesis |
link |
|
23 |
2024-03-27 |
Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding |
link |
|
23 |
2024-05-21 |
OmniGlue: Generalizable Feature Matching with Foundation Model Guidance |
link |
|
23 |
2023-08-26 |
Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs |
link |
|
23 |
2023-12-25 |
A Recipe for Scaling up Text-to-Video Generation with Text-free Videos |
link |
|
23 |
2023-10-10 |
MuseChat: A Conversational Music Recommendation System for Videos |
link |
|
23 |
2024-01-15 |
MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation |
link |
|
23 |
2023-12-28 |
Improving Image Restoration through Removing Degradations in Textual Representations |
link |
|
23 |
2023-12-05 |
FINER: Flexible Spectral-bias Tuning in Implicit NEural Representation by Variable-periodic Activation Functions |
link |
|
23 |
2024-03-27 |
ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation |
link |
|
23 |
2023-12-05 |
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models |
link |
|
23 |
2023-05-25 |
Learning Occupancy for Monocular 3D Object Detection |
link |
|
23 |
2023-06-15 |
Generative Proxemics: A Prior for 3D Social Interaction from Images |
link |
|
23 |
2023-12-31 |
Taming Mode Collapse in Score Distillation for Text-to-3D Generation |
link |
|
23 |
2023-12-26 |
HarmonyView: Harmonizing Consistency and Diversity in One-Image-to-3D |
link |
|
23 |
2024-03-04 |
HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances |
link |
|
23 |
2024-01-16 |
Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary |
link |
|
23 |
2024-04-10 |
HRVDA: High-Resolution Visual Document Assistant |
link |
|
22 |
2024-06-16 |
Adapt or Perish: Adaptive Sparse Transformer with Attentive Feature Refinement for Image Restoration |
link |
|
22 |
2024-04-11 |
MindBridge: A Cross-Subject Brain Decoding Framework |
link |
|
22 |
2023-03-25 |
UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes |
link |
|
22 |
2024-03-12 |
PeLK: Parameter-efficient Large Kernel ConvNets with Peripheral Convolution |
link |
|
22 |
2023-03-23 |
NOPE: Novel Object Pose Estimation from a Single Image |
link |
|
22 |
2024-04-29 |
An Aggregation-Free Federated Learning for Tackling Data Heterogeneity |
link |
|
22 |
2024-04-01 |
Towards Memorization-Free Diffusion Models |
link |
|
22 |
2023-12-12 |
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames |
link |
|
22 |
2023-12-11 |
Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior |
link |
|
22 |
2023-11-22 |
PIE-NeRF: Physics-based Interactive Elastodynamics with NeRF |
link |
|
22 |
2023-05-19 |
DAP: A Dynamic Adversarial Patch for Evading Person Detectors |
link |
|
22 |
2024-03-21 |
CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing |
link |
|
22 |
2024-04-25 |
TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation |
link |
|
22 |
2023-12-28 |
ARTrackV2: Prompting Autoregressive Tracker Where to Look and How to Describe |
link |
|
22 |
2023-12-07 |
Digital Life Project: Autonomous 3D Characters with Social Intelligence |
link |
|
22 |
2023-12-01 |
Dense Optical Tracking: Connecting the Dots |
link |
|
22 |
2024-01-24 |
LAA-Net: Localized Artifact Attention Network for Quality-Agnostic and Generalizable Deepfake Detection |
link |
|
22 |
2024-03-15 |
RCooper: A Real-world Large-scale Dataset for Roadside Cooperative Perception |
link |
|
22 |
2024-03-21 |
SpikingResformer: Bridging ResNet and Vision Transformer in Spiking Neural Networks |
link |
|
22 |
2024-04-04 |
Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation |
link |
|
22 |
2023-08-22 |
MatFuse: Controllable Material Generation with Diffusion Models |
link |
|
22 |
2023-09-14 |
DePT: Decoupled Prompt Tuning |
link |
|
22 |
2023-11-27 |
EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension |
link |
|
22 |
2024-03-28 |
OmniParser: A Unified Framework for Text Spotting Key Information Extraction and Table Recognition |
link |
|
22 |
2023-05-31 |
Control4D: Efficient 4D Portrait Editing with Text |
link |
|
22 |
2023-12-11 |
DreamControl: Control-Based Text-to-3D Generation with 3D Self-Prior |
link |
|
22 |
2024-04-06 |
InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization |
link |
|
21 |
2023-11-27 |
Text2Loc: 3D Point Cloud Localization from Natural Language |
link |
|
21 |
2024-04-05 |
DiffuseMix: Label-Preserving Data Augmentation with Diffusion Models |
link |
|
21 |
2023-11-29 |
Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications |
link |
|
21 |
2024-04-01 |
LLMs are Good Sign Language Translators |
link |
|
21 |
2024-04-02 |
ViTamin: Designing Scalable Vision Models in the Vision-Language Era |
link |
|
21 |
2023-06-07 |
CorrMatch: Label Propagation via Correlation Matching for Semi-Supervised Semantic Segmentation |
link |
|
21 |
2024-03-19 |
Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images |
link |
|
21 |
2024-05-19 |
Transcriptomics-guided Slide Representation Learning in Computational Pathology |
link |
|
21 |
2024-06-06 |
Matching Anything by Segmenting Anything |
link |
|
21 |
2023-12-11 |
Grounded Question-Answering in Long Egocentric Videos |
link |
|
21 |
2023-11-30 |
ElasticDiffusion: Training-free Arbitrary Size Image Generation through Global-Local Content Separation |
link |
|
21 |
2024-04-04 |
MonoCD: Monocular 3D Object Detection with Complementary Depths |
link |
|
21 |
2023-11-30 |
TCP:Textual-based Class-aware Prompt tuning for Visual-Language Model |
link |
|
21 |
2024-03-28 |
Test-Time Domain Generalization for Face Anti-Spoofing |
link |
|
21 |
2024-01-03 |
A Vision Check-up for Language Models |
link |
|
21 |
2024-03-19 |
FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation |
link |
|
21 |
2024-03-21 |
Volumetric Environment Representation for Vision-Language Navigation |
link |
|
21 |
2024-01-16 |
TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding |
link |
|
21 |
2024-04-29 |
EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars |
link |
|
21 |
2023-11-30 |
Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing |
link |
|
21 |
2023-12-14 |
DiffusionLight: Light Probes for Free by Painting a Chrome Ball |
link |
|
20 |
2023-12-04 |
ArtAdapter: Text-to-Image Style Transfer using Multi-Level Style Encoder and Explicit Adaptation |
link |
|
20 |
2024-03-26 |
Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models |
link |
|
20 |
2023-11-20 |
DiffAvatar: Simulation-Ready Garment Optimization with Differentiable Simulation |
link |
|
20 |
2023-12-18 |
CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update |
link |
|
20 |
2024-02-01 |
CapHuman: Capture Your Moments in Parallel Universes |
link |
|
20 |
2023-12-21 |
VCoder: Versatile Vision Encoders for Multimodal Large Language Models |
link |
|
20 |
2022-12-06 |
Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis |
link |
|
20 |
2024-02-08 |
Question Aware Vision Transformer for Multimodal Reasoning |
link |
|
20 |
2024-05-03 |
On the Test-Time Zero-Shot Generalization of Vision-Language Models: Do We Really Need Prompt Learning? |
link |
|
20 |
2024-01-04 |
Learning the 3D Fauna of the Web |
link |
|
20 |
2023-11-13 |
Open-Vocabulary Video Anomaly Detection |
link |
|
20 |
2024-04-16 |
Masked Autoencoders for Microscopy are Scalable Learners of Cellular Biology |
link |
|
20 |
2024-02-27 |
Accelerating Diffusion Sampling with Optimized Time Steps |
link |
|
20 |
2024-03-31 |
Towards Realistic Scene Generation with LiDAR Diffusion Models |
link |
|
20 |
2024-02-27 |
VoCo: A Simple-yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis |
link |
|
20 |
2023-12-21 |
ZeroShape: Regression-based Zero-shot Shape Reconstruction |
link |
|
20 |
2024-04-09 |
GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation |
link |
|
20 |
2024-03-22 |
Tri-Perspective View Decomposition for Geometry-Aware Depth Completion |
link |
|
20 |
2023-12-09 |
CoGS: Controllable Gaussian Splatting |
link |
|
20 |
2023-11-29 |
GenZI: Zero-Shot 3D Human-Scene Interaction Generation |
link |
|
20 |
2023-11-17 |
Multimodal Representation Learning by Alternating Unimodal Adaptation |
link |
|
19 |
2023-11-30 |
On Exact Inversion of DPM-Solvers |
link |
|
19 |
2024-03-19 |
Task-Customized Mixture of Adapters for General Image Fusion |
link |
|
19 |
2024-05-21 |
Nearest is Not Dearest: Towards Practical Defense against Quantization-conditioned Backdoor Attacks |
link |
|
19 |
2024-06-16 |
Revamping Federated Learning Security from a Defender's Perspective: A Unified Defense with Homomorphic Encrypted Data Space |
link |
|
19 |
2024-02-27 |
CAD-SIGNet: CAD Language Inference from Point Clouds using Layer-wise Sketch Instance Guided Attention |
link |
|
19 |
2023-11-29 |
Generalized Large-Scale Data Condensation via Various Backbone and Statistical Matching |
link |
|
19 |
2024-03-11 |
Exploiting Style Latent Flows for Generalizing Deepfake Video Detection |
link |
|
19 |
2023-12-04 |
Readout Guidance: Learning Control from Diffusion Features |
link |
|
19 |
2024-04-04 |
MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation |
link |
|
19 |
2023-12-05 |
C3: High-Performance and Low-Complexity Neural Compression from a Single Image or Video |
link |
|
19 |
2023-12-20 |
Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis |
link |
|
19 |
2023-11-09 |
Mirasol3B: A Multimodal Autoregressive Model for Time-Aligned and Contextual Modalities |
link |
|
19 |
2023-11-18 |
SecondPose: SE(3)-Consistent Dual-Stream Feature Fusion for Category-Level Pose Estimation |
link |
|
19 |
2023-12-12 |
GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos |
link |
|
19 |
2024-03-24 |
Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement |
link |
|
19 |
2024-02-12 |
Collaborative Semantic Occupancy Prediction with Hybrid Feature Fusion in Connected Automated Vehicles |
link |
|
19 |
2024-03-12 |
Dynamic Graph Representation with Knowledge-aware Attention for Histopathology Whole Slide Image Analysis |
link |
|
19 |
2024-01-02 |
Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation |
link |
|
19 |
2023-12-04 |
How to Configure Good In-Context Sequence for Visual Question Answering |
link |
|
19 |
2024-04-01 |
Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation |
link |
|
19 |
2023-07-24 |
CLIP-KD: An Empirical Study of CLIP Model Distillation |
link |
|
19 |
2024-03-15 |
LightIt: Illumination Modeling and Control for Diffusion Models |
link |
|
19 |
2024-03-19 |
AnySkill: Learning Open-Vocabulary Physical Skill for Interactive Agents |
link |
|
19 |
2024-02-22 |
CyberDemo: Augmenting Simulated Human Demonstration for Real-World Dexterous Manipulation |
link |
|
19 |
2024-03-25 |
TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models |
link |
|
19 |
2023-07-14 |
SubT-MRS Dataset: Pushing SLAM Towards All-weather Environments |
link |
|
19 |
2024-03-17 |
Bilateral Propagation Network for Depth Completion |
link |
|
19 |
2024-06-16 |
MAPLM: A Real-World Large-Scale Vision-Language Benchmark for Map and Traffic Scene Understanding |
link |
|
19 |
2024-02-29 |
PEM: Prototype-based Efficient MaskFormer for Image Segmentation |
link |
|
19 |
2024-06-16 |
MMA: Multi-Modal Adapter for Vision-Language Models |
link |
|
19 |
2022-11-15 |
Data Poisoning based Backdoor Attacks to Contrastive Learning |
link |
|
19 |
2024-04-06 |
Diffusion Time-step Curriculum for One Image to 3D Generation |
link |
|
18 |
2023-11-25 |
Towards Scalable 3D Anomaly Detection and Localization: A Benchmark via 3D Anomaly Synthesis and A Self-Supervised Learning Network |
link |
|
18 |
2024-03-04 |
Neural Redshift: Random Networks are not Random Functions |
link |
|
18 |
2023-12-05 |
HybridNeRF: Efficient Neural Rendering via Adaptive Volumetric Surfaces |
link |
|
18 |
2024-06-16 |
MultiDiff: Consistent Novel View Synthesis from a Single Image |
link |
|
18 |
2024-06-05 |
AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection |
link |
|
18 |
2024-03-31 |
Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction |
link |
|
18 |
2023-11-28 |
Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features |
link |
|
18 |
2023-11-30 |
CAT-DM: Controllable Accelerated Virtual Try-on with Diffusion Model |
link |
|
18 |
2023-12-06 |
Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation |
link |
|
18 |
2024-04-09 |
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering |
link |
|
18 |
2023-11-26 |
Insect-Foundation: A Foundation Model and Large-scale 1M Dataset for Visual Insect Understanding |
link |
|
18 |
2024-04-10 |
MoCha-Stereo: Motif Channel Attention Network for Stereo Matching |
link |
|
18 |
2024-04-14 |
DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection |
link |
|
18 |
2024-03-26 |
Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval |
link |
|
18 |
2023-12-04 |
PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation |
link |
|
18 |
2024-05-07 |
Tactile-Augmented Radiance Fields |
link |
|
18 |
2024-02-28 |
Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis |
link |
|
18 |
2024-04-01 |
MGMap: Mask-Guided Learning for Online Vectorized HD Map Construction |
link |
|
18 |
2023-12-06 |
TokenCompose: Text-to-Image Diffusion with Token-level Supervision |
link |
|