1580 |
2023-10-05 |
link |
Improved Baselines with Visual Instruction Tuning |
|
423 |
2023-11-27 |
link |
MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI |
|
360 |
2023-04-17 |
link |
DETRs Beat YOLOs on Real-time Object Detection |
|
351 |
2024-01-19 |
link |
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data |
|
346 |
2023-10-12 |
link |
4D Gaussian Splatting for Real-Time Dynamic Scene Rendering |
|
264 |
2023-11-07 |
link |
mPLUG-OwI2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration |
|
264 |
2023-10-23 |
link |
Wonder3D: Single Image to 3D Using Cross-Domain Diffusion |
|
243 |
2023-08-01 |
link |
LISA: Reasoning Segmentation via Large Language Model |
|
222 |
2023-09-22 |
link |
Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction |
|
204 |
2023-04-06 |
link |
InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning |
|
198 |
2023-11-21 |
link |
SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering |
|
182 |
2023-11-28 |
link |
Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation |
|
181 |
2023-12-12 |
link |
VILA: On Pre-training for Visual Language Models |
|
173 |
2023-11-28 |
link |
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark |
|
170 |
2023-11-27 |
link |
Mip-Splatting: Alias-Free 3D Gaussian Splatting |
|
167 |
2023-07-18 |
link |
AnyDoor: Zero-shot Object-level Image Customization |
|
166 |
2024-01-11 |
link |
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs |
|
160 |
2023-11-11 |
link |
Monkey: Image Resolution and Text Label are Important Things for Large Multi-Modal Models |
|
158 |
2023-09-28 |
link |
Text-to-3D using Gaussian Splatting |
|
158 |
2023-12-04 |
link |
SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM |
|
157 |
2023-12-20 |
link |
Generative Multimodal Models are In-Context Learners |
|
147 |
2023-03-08 |
link |
Video-P2P: Video Editing with Cross-Attention Control |
|
141 |
2023-12-11 |
link |
Gaussian Splatting SLAM |
|
135 |
2023-07-31 |
link |
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding |
|
134 |
2024-01-17 |
link |
VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models |
|
133 |
2023-11-29 |
link |
VBench: Comprehensive Benchmark Suite for Video Generative Models |
|
132 |
2023-11-30 |
link |
Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering |
|
132 |
2023-07-13 |
link |
HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models |
|
125 |
2023-06-26 |
link |
DragDiffusion: Harnessing Diffusion Models for Interactive Point-Based Image Editing |
|
122 |
2024-01-30 |
link |
YOLO-World: Real-Time Open-Vocabulary Object Detection |
|
121 |
2023-11-14 |
link |
One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion |
|
120 |
2023-12-19 |
link |
PixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction |
|
119 |
2023-12-21 |
link |
DUSt3R: Geometric 3D Vision Made Easy |
|
119 |
2023-11-20 |
link |
GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting |
|
116 |
2023-11-24 |
link |
GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting |
|
115 |
2023-11-30 |
link |
One-Step Diffusion with Distribution Matching Distillation |
|
114 |
2023-11-28 |
link |
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding |
|
113 |
2023-12-14 |
link |
Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers |
|
112 |
2023-11-27 |
link |
MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model |
|
110 |
2023-04-03 |
link |
DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models |
|
108 |
2023-11-21 |
link |
Diffusion Model Alignment Using Direct Preference Optimization |
|
108 |
2023-12-01 |
link |
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-Grained Correctional Human Feedback |
|
108 |
2023-11-19 |
link |
LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching |
|
105 |
2023-11-06 |
link |
GLaMM: Pixel Grounding Large Multimodal Model |
|
105 |
2023-12-01 |
link |
Sequential Modeling Enables Scalable Learning for Large Vision Models |
|
103 |
2023-11-29 |
link |
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation |
|
100 |
2023-11-14 |
link |
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding |
|
99 |
2023-11-22 |
link |
Compact 3D Gaussian Representation for Radiance Field |
|
98 |
2023-11-20 |
link |
PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics |
|
97 |
2023-12-07 |
link |
PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding |
|
97 |
2023-12-20 |
link |
Splatter Image: Ultra-Fast Single-View 3D Reconstruction |
|
95 |
2023-12-05 |
link |
ReconFusion: 3D Reconstruction with Diffusion Priors |
|
90 |
2023-12-28 |
link |
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action |
|
89 |
2023-12-13 |
link |
DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes |
|
88 |
2023-12-01 |
link |
EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything |
|
87 |
2023-04-12 |
link |
An Edit Friendly DDPM Noise Space: Inversion and Manipulations |
|
86 |
2023-12-04 |
link |
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding |
|
85 |
2023-12-26 |
link |
LangSplat: 3D Language Gaussian Splatting |
|
85 |
2023-10-23 |
link |
Hallusionbench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models |
|
83 |
2024-01-22 |
link |
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities |
|
83 |
2023-05-14 |
link |
ULIP-2: Towards Scalable Multimodal Pre-Training for 3D Understanding |
|
82 |
2023-11-29 |
link |
GaussianShader: 3D Gaussian Splatting with Shading Functions for Reflective Surfaces |
|
82 |
2023-12-04 |
link |
SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes |
|
82 |
2023-03-30 |
link |
InceptionNeXt: When Inception Meets ConvNeXt |
|
81 |
2023-12-06 |
link |
Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields |
|
79 |
2024-02-29 |
link |
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers |
|
79 |
2023-09-20 |
link |
FreeU: Free Lunch in Diffusion U-Net |
|
78 |
2023-12-21 |
link |
Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models |
|
76 |
2023-12-05 |
link |
Analyzing and Improving the Training Dynamics of Diffusion Models |
|
75 |
2023-06-08 |
link |
Grounded Text-to-Image Synthesis with Attention Refocusing |
|
75 |
2023-12-13 |
link |
FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects |
|
74 |
2023-03-16 |
link |
HIVE: Harnessing Human Feedback for Instructional Visual Editing |
|
74 |
2023-12-04 |
link |
Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation |
|
74 |
2023-12-04 |
link |
GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians |
|
73 |
2023-11-28 |
link |
RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D |
|
72 |
2023-10-17 |
link |
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models |
|
72 |
2023-11-16 |
link |
Emu Edit: Precise Image Editing via Recognition and Generation Tasks |
|
68 |
2023-12-01 |
link |
DeepCache: Accelerating Diffusion Models for Free |
|
68 |
2023-12-11 |
link |
Honeybee: Locality-Enhanced Projector for Multimodal LLM |
|
68 |
2023-12-28 |
link |
Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis |
|
67 |
2023-11-14 |
link |
UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs |
|
66 |
2024-02-27 |
link |
VastGaussian: Vast 3D Gaussians for Large Scene Reconstruction |
|
65 |
2023-12-04 |
link |
Style Aligned Image Generation via Shared Attention |
|
65 |
2023-11-27 |
link |
GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions |
|
64 |
2023-12-21 |
link |
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs |
|
64 |
2023-11-26 |
link |
GS-IR: 3D Gaussian Splatting for Inverse Rendering |
|
63 |
2023-11-29 |
link |
4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling |
|
62 |
2023-10-12 |
link |
GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models |
|
62 |
2023-08-15 |
link |
CoDeF: Content Deformation Fields for Temporally Consistent Video Processing |
|
62 |
2023-03-21 |
link |
CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation |
|
61 |
2023-09-07 |
link |
InstructDiffusion: A Generalist Modeling Interface for Vision Tasks |
|
61 |
2023-12-12 |
link |
LMDrive: Closed-Loop End-to-End Driving with Large Language Models |
|
60 |
2023-12-12 |
link |
COLMAP-Free 3D Gaussian Splatting |
|
60 |
2023-11-29 |
link |
Driving Into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving |
|
60 |
2024-01-24 |
link |
Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild |
|
60 |
2023-12-15 |
link |
SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery |
|
59 |
2023-11-27 |
link |
SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution |
|
58 |
2024-01-08 |
link |
GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation |
|
58 |
2023-11-28 |
link |
Photo-SLAM: Real-Time Simultaneous Localization and Photorealistic Mapping for Monocular, Stereo, and RGB-D Cameras |
|
57 |
2023-08-18 |
link |
SimDA: Simple Diffusion Adapter for Efficient Video Generation |
|
56 |
2023-11-17 |
link |
Compressed 3D Gaussian Splatting for Accelerated Novel View Synthesis |
|
56 |
2023-11-30 |
link |
Rethinking FID: Towards a Better Evaluation Metric for Image Generation |
|
56 |
2023-12-06 |
link |
OneLLM: One Framework to Align All Modalities with Language |
|
56 |
2023-12-05 |
link |
GauHuman: Articulated Gaussian Splatting from Monocular Human Videos |
|
55 |
2023-12-04 |
link |
GPS-Gaussian: Generalizable Pixel-Wise 3D Gaussian Splatting for Real-Time Human Novel View Synthesis |
|
55 |
2023-10-19 |
link |
Putting the Object Back into Video Object Segmentation |
|
55 |
2023-12-04 |
link |
GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians |
|
55 |
2023-11-18 |
link |
Make Pixels Dance: High-Dynamic Video Generation |
|
54 |
2023-11-28 |
link |
HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting |
|
54 |
2023-11-29 |
link |
HUGS: Human Gaussian Splats |
|
54 |
2023-12-14 |
link |
3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting |
|
53 |
2023-11-10 |
link |
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks |
|
53 |
2023-12-01 |
link |
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts |
|
53 |
2023-11-28 |
link |
SEED-Bench-2: Benchmarking Multimodal Large Language Models |
|
52 |
2023-11-27 |
link |
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition |
|
52 |
2023-11-30 |
link |
VTimeLLM: Empower LLM to Grasp Video Moments |
|
51 |
2023-12-06 |
link |
Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle |
|
50 |
2023-11-27 |
link |
MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers |
|
50 |
2023-11-28 |
link |
Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering |
|
50 |
2023-06-16 |
link |
PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation |
|
49 |
2024-06-16 |
link |
OpenEQA: Embodied Question Answering in the Era of Foundation Models |
|
49 |
2024-03-27 |
link |
UniDepth: Universal Monocular Metric Depth Estimation |
|
48 |
2023-11-30 |
link |
Diffusion Models Without Attention |
|
48 |
2024-03-11 |
link |
DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization |
|
47 |
2023-11-29 |
link |
MoMask: Generative Masked Modeling of 3D Human Motions |
|
47 |
2023-12-24 |
link |
ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation |
|
46 |
2024-04-12 |
link |
Probing the 3D Awareness of Visual Foundation Models |
|
46 |
2023-10-02 |
link |
HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation |
|
45 |
2023-12-07 |
link |
Scaling Laws of Synthetic Images for Model Training … for Now |
|
45 |
2023-12-15 |
link |
Osprey: Pixel Understanding with Visual Instruction Tuning |
|
44 |
2023-12-15 |
link |
PLGSLAM: Progressive Neural Scene Represenation with Local to Global Bundle Adjustment |
|
44 |
2023-11-22 |
link |
Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model |
|
43 |
2023-12-06 |
link |
Alpha-CLIP: A CLIP Model Focusing on Wherever you Want |
|
43 |
2023-05-25 |
link |
Prompt-Free Diffusion: Taking “Text” Out of Text-to-Image Diffusion Models |
|
43 |
2023-06-20 |
link |
Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation |
|
43 |
2023-11-27 |
link |
Compositional Chain-of-Thought Prompting for Large Multimodal Models |
|
42 |
2023-04-03 |
link |
RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding |
|
42 |
2023-08-23 |
link |
Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion |
|
42 |
2023-11-30 |
link |
Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding |
|
42 |
2023-09-06 |
link |
Bayes' Rays: Uncertainty Quantification for Neural Radiance Fields |
|
41 |
2023-11-27 |
link |
GART: Gaussian Articulated Template Models |
|
41 |
None |
link |
EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion Prior |
|
40 |
2023-11-28 |
link |
Human Gaussian Splatting: Real-Time Rendering of Animatable Avatars |
|
40 |
2023-10-17 |
link |
4K4D: Real-Time 4D View Synthesis at 4K Resolution |
|
40 |
2023-09-14 |
link |
Generative Image Dynamics |
|
39 |
2024-03-08 |
link |
SplattingAvatar: Realistic Real-Time Human Avatars With Mesh-Embedded Gaussian Splatting |
|
39 |
2023-11-21 |
link |
SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction |
|
38 |
2023-10-16 |
link |
LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation |
|
38 |
2023-12-06 |
link |
Relightable Gaussian Codec Avatars |
|
38 |
2023-07-14 |
link |
NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis |
|
38 |
2023-06-30 |
link |
Disco: Disentangled Control for Realistic Human Dance Generation |
|
38 |
2023-12-08 |
link |
Reconstructing Hands in 3D with Transformers |
|
37 |
2024-02-08 |
link |
MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis |
|
36 |
2023-11-22 |
link |
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data |
|
36 |
2023-12-10 |
link |
ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering |
|
36 |
2023-12-14 |
link |
Holodeck: Language Guided Generation of 3D Embodied AI Environments |
|
36 |
2024-02-05 |
link |
InstanceDiffusion: Instance-Level Control for Image Generation |
|
36 |
2023-06-27 |
link |
Symphonize 3D Semantic Scene Completion with Contextual Instance Queries |
|
36 |
2023-11-27 |
link |
Optimal Transport Aggregation for Visual Place Recognition |
|
36 |
2023-11-30 |
link |
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning |
|
35 |
2023-11-28 |
link |
A Unified Approach for Text-and Image-Guided 4D Scene Generation |
|
35 |
2023-10-31 |
link |
CapsFusion: Rethinking Image-Text Data at Scale |
|
35 |
2023-11-28 |
link |
LEDITS++: Limitless Image Editing Using Text-to-Image Models |
|
34 |
2023-12-26 |
link |
DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision |
|
34 |
2023-12-21 |
link |
Paint3D: Paint Anything 3D With Lighting-Less Texture Diffusion Models |
|
34 |
2023-12-08 |
link |
SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation |
|
33 |
2023-09-20 |
link |
RMT: Retentive Networks Meet Vision Transformers |
|
33 |
2024-01-18 |
link |
OMG-Seg: Is One Model Good Enough for all Segmentation? |
|
32 |
2024-03-18 |
link |
Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters |
|
32 |
2023-12-04 |
link |
PixelLM: Pixel Reasoning with Large Multimodal Model |
|
32 |
2024-04-08 |
link |
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding |
|
32 |
2023-12-12 |
link |
FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition |
|
32 |
2023-12-05 |
link |
Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving? |
|
31 |
2023-12-02 |
link |
Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction |
|
31 |
2023-12-11 |
link |
SmartEdit: Exploring Complex Instruction-Based Image Editing with Multimodal Large Language Models |
|
31 |
2023-11-27 |
link |
CoSeR: Bridging Image and Language for Cognitive Super-Resolution |
|
31 |
2023-06-01 |
link |
Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models |
|
31 |
2023-12-01 |
link |
VideoBooth: Diffusion-based Video Generation with Image Prompts |
|
30 |
2023-12-12 |
link |
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model |
|
30 |
2023-11-27 |
link |
SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation |
|
30 |
2023-12-26 |
link |
SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation |
|
30 |
2023-08-18 |
link |
Towards Large-Scale 3D Representation Learning with Multi-Dataset Point Prompt Training |
|
30 |
2023-11-27 |
link |
CG-HOI: Contact-Guided 3D Human-Object Interaction Generation |
|
29 |
2023-12-07 |
link |
Free3D: Consistent Novel View Synthesis Without 3D Representation |
|
29 |
2023-12-11 |
link |
Style Injection in Diffusion: A Training-Free Approach for Adapting Large-Scale Diffusion Models for Style Transfer |
|
29 |
2023-12-12 |
link |
WHAM: Reconstructing World-Grounded Humans with Accurate 3D Motion |
|
29 |
2023-11-26 |
link |
NeuRAD: Neural Rendering for Autonomous Driving |
|
29 |
2023-12-17 |
link |
Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance |
|
29 |
2023-11-30 |
link |
GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs |
|
29 |
2024-03-20 |
link |
Multi-Modal Hallucination Control by Visual Information Grounding |
|
29 |
2023-11-23 |
link |
SinSR: Diffusion-Based Image Super-Resolution in a Single Step |
|
29 |
2023-12-06 |
link |
Cache Me if You Can: Accelerating Diffusion Models through Block Caching |
|
28 |
2024-03-03 |
link |
3DGStream: On-the-Fly Training of 3D Gaussians for Efficient Streaming of Photo-Realistic Free-Viewpoint Videos |
|
28 |
2023-11-29 |
link |
Gaussian Shell Maps for Efficient 3D Human Generation |
|
28 |
2024-02-08 |
link |
Editable Scene Simulation for Autonomous Driving via Collaborative LLM-Agents |
|
28 |
2023-09-28 |
link |
CCEdit: Creative and Controllable Video Editing via Diffusion Models |
|
28 |
2023-11-18 |
link |
SNI-SLAM: Semantic Neural Implicit SLAM |
|
28 |
2023-12-18 |
link |
Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering |
|
28 |
2023-10-12 |
link |
UniPAD: A Universal Pre-Training Paradigm for Autonomous Driving |
|
28 |
2023-11-24 |
link |
DemoFusion: Democratising High-Resolution Image Generation With No $$$ |
|
27 |
2024-01-31 |
link |
Binding Touch to Everything: Learning Unified Multimodal Tactile Representations |
|
27 |
2023-05-24 |
link |
RoMa: Robust Dense Feature Matching |
|
27 |
2023-10-01 |
link |
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs |
|
27 |
2023-12-11 |
link |
PortraitBooth: A Versatile Portrait Model for Fast Identity-Preserved Personalization |
|
27 |
2024-02-27 |
link |
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners |
|
27 |
2023-11-29 |
link |
MMA-Diffusion: MultiModal Attack on Diffusion Models |
|
27 |
2023-12-14 |
link |
Mosaic-SDF for 3D Generative Models |
|
27 |
2024-03-10 |
link |
MACE: Mass Concept Erasure in Diffusion Models |
|
27 |
2023-11-28 |
link |
Panacea: Panoramic and Controllable Video Generation for Autonomous Driving |
|
26 |
2023-12-07 |
link |
LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs |
|
26 |
2023-11-28 |
link |
TransNeXt: Robust Foveal Visual Perception for Vision Transformers |
|
26 |
2023-11-30 |
link |
BioCLIP: A Vision Foundation Model for the Tree of Life |
|
26 |
2023-11-20 |
link |
BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning |
|
26 |
2024-01-17 |
link |
Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior |
|
26 |
2023-12-07 |
link |
RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models |
|
26 |
2024-02-22 |
link |
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis |
|
25 |
2024-03-14 |
link |
Generalized Predictive Model for Autonomous Driving |
|
25 |
2023-12-01 |
link |
VMC: Video Motion Customization Using Temporal Attention Adaption for Text-to-Video Diffusion Models |
|
25 |
2024-03-29 |
link |
Rewrite the Stars |
|
25 |
2024-01-17 |
link |
GARField: Group Anything with Radiance Fields |
|
25 |
2024-03-11 |
link |
FreGS: 3D Gaussian Splatting with Progressive Frequency Regularization |
|
25 |
2023-12-26 |
link |
EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI |
|
25 |
2023-09-26 |
link |
Event Stream-Based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline |
|
25 |
2024-01-03 |
link |
Instruct-Imagen: Image Generation with Multi-modal Instruction |
|
25 |
2023-12-18 |
link |
GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning |
|
25 |
2023-12-03 |
link |
ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models |
|
24 |
2024-02-29 |
link |
DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models |
|
24 |
2024-03-10 |
link |
Poly Kernel Inception Network for Remote Sensing Detection |
|
24 |
2023-12-14 |
link |
General Object Foundation Model for Images and Videos at Scale |
|
24 |
2023-09-01 |
link |
CityDreamer: Compositional Generative Model of Unbounded 3D Cities |
|
24 |
2023-12-14 |
link |
A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions |
|
24 |
2023-04-13 |
link |
Modeling Dense Multimodal Interactions Between Biological Pathways and Histology for Survival Prediction |
|
24 |
2023-11-28 |
link |
Shadows Don't Lie and Lines Can't Bend! Generative Models Don't know Projective Geometry…for Now |
|
24 |
2024-02-27 |
link |
Neural Video Compression with Feature Modulation |
|
23 |
2023-12-06 |
link |
WonderJourney: Going from Anywhere to Everywhere |
|
23 |
2023-11-27 |
link |
Self-Correcting LLM-Controlled Diffusion Models |
|
23 |
2023-11-30 |
link |
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation |
|
23 |
2023-11-23 |
link |
GigaPose: Fast and Robust Novel Object Pose Estimation via One Correspondence |
|
23 |
2024-03-04 |
link |
ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models |
|
23 |
2023-11-28 |
link |
Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following |
|
23 |
2023-11-28 |
link |
LLaFS: When Large Language Models Meet Few-Shot Segmentation |
|
23 |
2023-12-29 |
link |
Visual Point Cloud Forecasting Enables Scalable Autonomous Driving |
|
23 |
2023-12-06 |
link |
HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting |
|
22 |
2023-12-26 |
link |
One-dimensional Adapter to Rule Them All: Concepts, Diffusion Models and Erasing Applications |
|
22 |
2023-09-04 |
link |
Can I Trust Your Answer? Visually Grounded Video Question Answering |
|
22 |
2023-12-19 |
link |
Optimizing Diffusion Noise Can Serve As Universal Motion Priors |
|
22 |
2024-04-05 |
link |
SpatialTracker: Tracking Any 2D Pixels in 3D Space |
|
22 |
2023-09-06 |
link |
Diffusion-EDFs: Bi-Equivariant Denoising Generative Modeling on SE(3) for Visual Robotic Manipulation |
|
22 |
2024-02-04 |
link |
DiffEditor: Boosting Accuracy and Flexibility on Diffusion-Based Image Editing |
|
22 |
2024-04-09 |
link |
3D Geometry-aware Deformable Gaussian Splatting for Dynamic View Synthesis |
|
22 |
2023-12-17 |
link |
VidToMe: Video Token Merging for Zero-Shot Video Editing |
|
22 |
2023-11-19 |
link |
Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection |
|
22 |
2024-02-26 |
link |
Groundhog Grounding Large Language Models to Holistic Segmentation |
|
22 |
2023-12-27 |
link |
Forgery-aware Adaptive Transformer for Generalizable Synthetic Image Detection |
|
22 |
2024-03-12 |
link |
ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions |
|
22 |
2023-12-12 |
link |
EditGuard: Versatile Image Watermarking for Tamper Localization and Copyright Protection |
|
22 |
2024-02-29 |
link |
Towards Generalizable Tumor Synthesis |
|
22 |
2023-12-17 |
link |
SAI3D: Segment any Instance in 3D Scenes |
|
21 |
2024-03-04 |
link |
RegionGPT: Towards Region Understanding Vision Language Model |
|
21 |
2023-11-30 |
link |
Fast ODE-based Sampling for Diffusion Models in Around 5 Steps |
|
21 |
2023-11-24 |
link |
OneFormer3D: One Transformer for Unified Point Cloud Segmentation |
|
21 |
2023-11-29 |
link |
SODA: Bottleneck Diffusion Models for Representation Learning |
|
21 |
2023-12-04 |
link |
VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence |
|
21 |
2023-11-27 |
link |
SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation |
|
21 |
2023-06-07 |
link |
WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models |
|
21 |
2023-12-01 |
link |
Grounding Everything: Emerging Localization Properties in Vision-Language Transformers |
|
21 |
2024-03-06 |
link |
Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation |
|
21 |
2023-11-26 |
link |
BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP |
|
21 |
2024-02-27 |
link |
Preserving Fairness Generalization in Deepfake Detection |
|
20 |
2024-03-11 |
link |
Toward Generalist Anomaly Detection via In-Context Residual Learning with Few-Shot Sample Prompts |
|
20 |
2024-03-03 |
link |
Logit Standardization in Knowledge Distillation |
|
20 |
2024-03-19 |
link |
Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection |
|
20 |
2024-01-23 |
link |
The Neglected Tails in Vision-Language Models |
|
20 |
2024-03-06 |
link |
Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing |
|
20 |
2023-12-13 |
link |
FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models |
|
20 |
2023-12-07 |
link |
Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation |
|
20 |
2024-03-02 |
link |
TUMTraf V2X Cooperative Perception Dataset |
|
20 |
2024-02-14 |
link |
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM |
|
20 |
2023-12-07 |
link |
Hierarchical Spatio-temporal Decoupling for Text-to- Video Generation |
|
20 |
2023-11-28 |
link |
Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer |
|
20 |
2023-12-15 |
link |
Rich Human Feedback for Text-to-Image Generation |
|
20 |
2024-03-19 |
link |
HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting |
|
20 |
2023-09-29 |
link |
Text-Image Alignment for Diffusion-Based Perception |
|
20 |
2023-12-01 |
link |
Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion |
|
20 |
2023-11-29 |
link |
SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis |
|
20 |
2023-12-14 |
link |
Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft |
|
20 |
2024-02-06 |
link |
EscherNet: A Generative Model for Scalable View Synthesis |
|
20 |
2023-11-27 |
link |
Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion |
|
19 |
2023-06-27 |
link |
Detector-Free Structure from Motion |
|
19 |
2023-12-06 |
link |
MMM: Generative Masked Motion Model |
|
19 |
2024-02-07 |
link |
SPAD: Spatially Aware Multi-View Diffusers |
|
19 |
2023-12-07 |
link |
GenTron: Diffusion Transformers for Image and Video Generation |
|
19 |
2023-12-05 |
link |
Orthogonal Adaptation for Modular Customization of Diffusion Models |
|
19 |
2023-09-12 |
link |
Language Models as Black-Box Optimizers for Vision-Language Models |
|
19 |
2024-01-03 |
link |
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations |
|
19 |
2023-12-08 |
link |
ControlRoom3D: Room Generation Using Semantic Proxy Rooms |
|
19 |
2023-12-29 |
link |
FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis |
|
19 |
2024-03-18 |
link |
Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning |
|
19 |
2023-12-06 |
link |
On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm |
|
19 |
2024-02-15 |
link |
DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization |
|
19 |
2024-02-08 |
link |
Driving Everywhere with Large Language Model Policy Adaptation |
|
19 |
2023-11-22 |
link |
Visual in-Context Prompting |
|
19 |
2023-11-28 |
link |
Wavelet-based Fourier Information Interaction with Frequency Diffusion Adjustment for Underwater Image Restoration |
|
19 |
2023-12-12 |
link |
CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor |
|
19 |
2023-12-21 |
link |
PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models |
|
19 |
2024-02-23 |
link |
State Space Models for Event Cameras |
|
19 |
2023-12-18 |
link |
SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution |
|
19 |
2023-11-28 |
link |
SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors |
|
19 |
2023-11-28 |
link |
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training |
|
19 |
2024-04-04 |
link |
WorDepth: Variational Language Prior for Monocular Depth Estimation |
|
19 |
2023-12-12 |
link |
MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception |
|
19 |
2023-12-06 |
link |
On the Robustness of Large Multimodal Models Against Image Adversarial Attacks |
|
18 |
2024-01-10 |
link |
VLP: Vision Language Planning for Autonomous Driving |
|
18 |
2023-12-05 |
link |
GPT4Point: A Unified Framework for Point-Language Understanding and Generation |
|
18 |
2023-12-11 |
link |
DreamControl: Control-Based Text-to-3D Generation with 3D Self-Prior |
|
18 |
2024-03-13 |
link |
Scaling Up Dynamic Human-Scene Interaction Modeling |
|
18 |
2023-11-11 |
link |
PerceptionGPT: Effectively Fusing Visual Perception Into LLM |
|
18 |
2024-01-11 |
link |
Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications |
|
18 |
2023-12-15 |
link |
GSVA: Generalized Segmentation via Multimodal Large Language Models |
|
18 |
2024-01-17 |
link |
Vlogger: Make Your Dream A Vlog |
|
18 |
2024-04-15 |
link |
PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI |
|
18 |
2024-04-08 |
link |
LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding |
|
18 |
2023-12-19 |
link |
InstructVideo: Instructing Video Diffusion Models with Human Feedback |
|
18 |
2023-12-11 |
link |
EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion |
|
18 |
2023-12-11 |
link |
CONFORM: Contrast is All You Need For High-Fidelity Text-to-Image Diffusion Models |
|
18 |
2023-12-04 |
link |
COTR: Compact Occupancy TRansformer for Vision-Based 3D Occupancy Prediction |
|
18 |
2023-05-25 |
link |
Learning Occupancy for Monocular 3D Object Detection |
|
18 |
2023-11-20 |
link |
OmniSeg3D: Omniversal 3D Segmentation via Hierarchical Contrastive Learning |
|
17 |
2023-12-11 |
link |
Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior |
|
17 |
2023-08-20 |
link |
Boosting Adversarial Transferability by Block Shuffle and Rotation |
|
17 |
2024-03-24 |
link |
EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World |
|
17 |
2023-06-15 |
link |
Generative Proxemics: A Prior for 3D Social Interaction from Images |
|
17 |
2024-01-17 |
link |
TextureDreamer: Image-Guided Texture Synthesis through Geometry-Aware Diffusion |
|
17 |
2024-04-01 |
link |
Video Interpolation with Diffusion Models |
|
17 |
2023-12-05 |
link |
Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation |
|
17 |
2023-05-19 |
link |
Equivariant Multi-Modality Image Fusion |
|
17 |
2024-02-20 |
link |
Video ReCap: Recursive Captioning of Hour-Long Videos |
|
17 |
2023-12-02 |
link |
Diffusion Handles Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D |
|
17 |
2023-12-12 |
link |
A Simple Recipe for Contrastively Pre-Training Video-First Encoders Beyond 16 Frames |
|
17 |
2023-03-25 |
link |
UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes |
|
16 |
2024-01-31 |
link |
AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error |
|
16 |
2023-12-21 |
link |
VCoder: Versatile Vision Encoders for Multimodal Large Language Models |
|
16 |
2024-04-04 |
link |
Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation |
|
16 |
2023-12-04 |
link |
Aligning and Prompting Everything All at Once for Universal Visual Perception |
|
16 |
2023-11-28 |
link |
Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence |
|
16 |
2024-01-04 |
link |
Learning the 3D Fauna of the Web |
|
16 |
2024-02-14 |
link |
Loopy-SLAM: Dense Neural SLAM with Loop Closures |
|
16 |
2023-11-20 |
link |
DiffAvatar: Simulation-Ready Garment Optimization with Differentiable Simulation |
|
16 |
2023-12-19 |
link |
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model |
|
16 |
2023-12-10 |
link |
SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction |
|
16 |
2023-12-04 |
link |
Towards Learning a Generalist Model for Embodied Navigation |
|
16 |
2023-05-31 |
link |
Control4D: Efficient 4D Portrait Editing With Text |
|
16 |
2023-11-09 |
link |
Mirasol3B: A Multimodal Autoregressive Model for Time-Aligned and Contextual Modalities |
|
16 |
2023-12-01 |
link |
Dense Optical Tracking: Connecting the Dots |
|
16 |
2023-12-18 |
link |
SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing |
|
16 |
2023-12-07 |
link |
Open-Vocabulary Segmentation with Semantic-Assisted Calibration |
|
16 |
2023-12-21 |
link |
ZeroShape: Regression-Based Zero-Shot Shape Reconstruction |
|
16 |
2024-03-01 |
link |
Rethinking Inductive Biases for Surface Normal Estimation |
|
16 |
2024-03-18 |
link |
MCD: Diverse Large-Scale Multi-Campus Dataset for Robot Perception |
|
16 |
2023-12-05 |
link |
Describing Differences in Image Sets with Natural Language |
|
16 |
2023-11-28 |
link |
Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld |
|
15 |
2023-12-28 |
link |
ZONE: Zero-Shot Instruction-Guided Local Editing |
|
15 |
2024-03-26 |
link |
Move as you Say, Interact as you can: Language-Guided Human Motion Generation with Scene Affordance |
|
15 |
2023-12-18 |
link |
CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update |
|
15 |
2023-12-06 |
link |
AVID: Any-Length Video Inpainting with Diffusion Model |
|
15 |
2024-04-15 |
link |
SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction |
|
15 |
2023-11-27 |
link |
TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models |
|
15 |
2023-12-05 |
link |
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models |
|
15 |
2024-03-08 |
link |
Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery |
|
15 |
2023-12-12 |
link |
Peekaboo: Interactive Video Generation via Masked-Diffusion |
|
15 |
2024-04-10 |
link |
HRVDA: High-Resolution Visual Document Assistant |
|
15 |
2024-01-03 |
link |
A Vision Check-up for Language Models |
|
15 |
2024-04-30 |
link |
XFeat: Accelerated Features for Lightweight Image Matching |
|
15 |
2024-01-25 |
link |
pix2gestalt: Amodal Segmentation by Synthesizing Wholes |
|
15 |
2023-11-30 |
link |
MotionEditor: Editing Video Motion via Content-Aware Diffusion |
|
15 |
2023-12-26 |
link |
HarmonyView: Harmonizing Consistency and Diversity in One-Image-to-3D |
|
15 |
2023-12-05 |
link |
FINER: Flexible Spectral-Bias Tuning in Implicit NEural Representation by Variableperiodic Activation Functions |
|
15 |
2024-05-07 |
link |
DriveWorld: 4D Pre-Trained Scene Understanding via World Models for Autonomous Driving |
|
15 |
2023-01-26 |
link |
Discovering and Mitigating Visual Biases Through Keyword Explanation |
|
15 |
2024-04-07 |
link |
Gaussian Shading: Provable Performance-Lossless Image Watermarking for Diffusion Models |
|
15 |
2023-11-27 |
link |
SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion |
|
15 |
2023-11-26 |
link |
Insect-Foundation: A Foundation Model and Large-Scale 1M Dataset for Visual Insect Understanding |
|
15 |
2023-11-27 |
link |
Single-Model and Any-Modality for Video Object Tracking |
|
15 |
2024-03-04 |
link |
HanDiffuser: Text-to-Image Generation with Realistic Hand Appearances |
|
15 |
2023-12-19 |
link |
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation |
|
15 |
2024-03-14 |
link |
OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning |
|
15 |
2023-12-27 |
link |
SVGDreamer: Text Guided SVG Generation with Diffusion Model |
|
15 |
2023-05-19 |
link |
DAP: A Dynamic Adversarial Patch for Evading Person Detectors |
|
14 |
2023-10-03 |
link |
Sieve: Multimodal Dataset Pruning Using Image Captioning Models |
|
14 |
2024-03-09 |
link |
Robust Emotion Recognition in Context Debiasing |
|
14 |
2024-01-09 |
link |
DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-Driven Holistic 3D Expression and Gesture Generation |
|
14 |
2024-02-19 |
link |
Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships |
|
14 |
2023-12-05 |
link |
Multimodal Prompt Perceiver: Empower Adaptiveness, Generalizability and Fidelity for All-in-One Image Restoration |
|
14 |
2023-12-20 |
link |
Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis |
|
14 |
2023-12-07 |
link |
Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models |
|
14 |
2023-07-14 |
link |
SubT-MRS Dataset: Pushing SLAM Towards All-weather Environments |
|
14 |
2023-11-27 |
link |
Text2Loc: 3D Point Cloud Localization from Natural Language |
|
14 |
2024-04-06 |
link |
Initno: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization |
|
14 |
2023-05-27 |
link |
Zero-TPrune: Zero-Shot Token Pruning Through Leveraging of the Attention Graph in Pre-Trained Transformers |
|
14 |
2023-12-07 |
link |
Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos |
|
14 |
2024-03-25 |
link |
TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models |
|
14 |
2024-03-15 |
link |
IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation |
|
14 |
2023-08-22 |
link |
MatFuse: Controllable Material Generation with Diffusion Models |
|
14 |
2024-03-11 |
link |
DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations |
|
14 |
2024-01-01 |
link |
Retrieval-Augmented Egocentric Video Captioning |
|
14 |
2023-12-07 |
link |
NeRFiller: Completing Scenes via Generative 3D Inpainting |
|
14 |
2024-06-16 |
link |
SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes |
|
14 |
2023-12-05 |
link |
C3: High-Performance and Low-Complexity Neural Compression from a Single Image or Video |
|
14 |
2024-03-30 |
link |
InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning |
|
14 |
2024-01-29 |
link |
SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design |
|
14 |
2023-12-03 |
link |
FlashAvatar: High-Fidelity Head Avatar with Efficient Gaussian Embedding |
|
14 |
2023-12-09 |
link |
CoGS: Controllable Gaussian Splatting |
|
14 |
2024-04-01 |
link |
Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation |
|
14 |
2023-12-28 |
link |
Improving Image Restoration Through Removing Degradations in Textual Representations |
|
14 |
2023-12-26 |
link |
Inter-X: Towards Versatile Human-Human Interaction Analysis |
|
14 |
2023-11-29 |
link |
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning |
|
14 |
2024-05-29 |
link |
NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild |
|
13 |
2023-10-10 |
link |
MuseChat: A Conversational Music Recommendation System for Videos |
|
13 |
2024-02-23 |
link |
Seamless Human Motion Composition with Blended Positional Encodings |
|
13 |
2023-12-25 |
link |
A Recipe for Scaling up Text-to-Video Generation with Text-free Videos |
|
13 |
2024-02-27 |
link |
VRP-SAM: SAM with Visual Reference Prompt |
|
13 |
2024-03-15 |
link |
RCooper: A Real-world Large-scale Dataset for Roadside Cooperative Perception |
|
13 |
2023-11-18 |
link |
Structure-Aware Sparse-View X-Ray 3D Reconstruction |
|
13 |
2023-03-24 |
link |
DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis |
|
13 |
2023-11-30 |
link |
ElasticDiffusion: Training-Free Arbitrary Size Image Generation Through Global-Local Content Separation |
|
13 |
2024-03-05 |
link |
Sniffer: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection |
|
13 |
2023-12-31 |
link |
Taming Mode Collapse in Score Distillation for Text-to-3D Generation |
|
13 |
2024-03-29 |
link |
FairCLIP: Harnessing Fairness in Vision-Language Learning |
|
13 |
2024-05-07 |
link |
Tactile-Augmented Radiance Fields |
|
13 |
2023-11-30 |
link |
HOLD: Category-Agnostic 3D Reconstruction of Interacting Hands and Objects from Video |
|
13 |
2024-03-20 |
link |
DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception |
|
13 |
2023-11-26 |
link |
Visual Programming for Zero-Shot Open-Vocabulary 3D Visual Grounding |
|
13 |
2023-11-21 |
link |
Breathing Life Into Sketches Using Text-to-Video Priors |
|
13 |
2024-03-27 |
link |
ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation |
|
13 |
2023-12-07 |
link |
Text-to-3D Generation with Bidirectional Diffusion Using Both 2D and 3D Priors |
|
13 |
2023-11-30 |
link |
ChatPose: Chatting about 3D Human Pose |
|
13 |
2024-03-19 |
link |
Fresco: Spatial-Temporal Correspondence for Zero-Shot Video Translation |
|
13 |
2024-03-21 |
link |
CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-Spoofing |
|
13 |
2023-11-22 |
link |
PIE-NeRF: Physics-Based Interactive Elastodynamics with NeRF |
|
13 |
2024-01-02 |
link |
Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation |
|
13 |
2023-12-07 |
link |
Digital Life Project: Autonomous 3D Characters with Social Intelligence |
|
13 |
2023-12-10 |
link |
AM-RADIO: Agglomerative Vision Foundation Model Reduce All Domains Into One |
|
13 |
2023-06-15 |
link |
Seeing the World through Your Eyes |
|
13 |
2024-05-21 |
link |
Nearest is Not Dearest: Towards Practical Defense Against Quantization-Conditioned Backdoor Attacks |
|
13 |
2024-04-06 |
link |
Diffusion Time-step Curriculum for One Image to 3D Generation |
|
13 |
2023-12-07 |
link |
MuRF: Multi-Baseline Radiance Fields |
|
13 |
2023-12-11 |
link |
Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution |
|
13 |
2023-03-23 |
link |
NOPE: Novel Object Pose Estimation from a Single Image |
|
13 |
2023-12-05 |
link |
HybridNeRF: Efficient Neural Rendering via Adaptive Volumetric Surfaces |
|
12 |
2023-08-19 |
link |
Noisy-Correspondence Learning for Text-to-Image Person Re-Identification |
|
12 |
2023-11-28 |
link |
AvatarGPT: All-in-One Framework for Motion Understanding, Planning, Generation and Beyond |
|
12 |
2024-03-01 |
link |
Selective-Stereo: Adaptive Frequency Information Selection for Stereo Matching |
|
12 |
2024-04-08 |
link |
PromptAD: Learning Prompts with only Normal Samples for Few-Shot Anomaly Detection |
|
12 |
2024-05-02 |
link |
Multi-Space Alignments Towards Universal LiDAR Segmentation |
|
12 |
2023-01-22 |
link |
Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation |
|
12 |
2023-12-07 |
link |
ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations |
|
12 |
2024-03-05 |
link |
Interactive Continual Learning: Fast and Slow Thinking |
|
12 |
2023-12-28 |
link |
Amodal Ground Truth and Completion in the Wild |
|
12 |
2023-11-27 |
link |
Evcap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension |
|
12 |
2024-03-25 |
link |
VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation |
|
12 |
2022-12-06 |
link |
Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis |
|
12 |
2023-11-29 |
link |
One-Shot Open Affordance Learning with Foundation Models |
|
12 |
2023-12-04 |
link |
Readout Guidance: Learning Control from Diffusion Features |
|
12 |
2023-12-04 |
link |
PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation |
|
12 |
2024-03-18 |
link |
HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data |
|
12 |
2024-04-01 |
link |
Towards Memorization-Free Diffusion Models |
|
12 |
2023-04-05 |
link |
VicTR: Video-conditioned Text Representations for Activity Recognition |
|
12 |
2024-03-09 |
link |
RealNet: A Feature Selection Network with Realistic Synthetic Anomaly for Anomaly Detection |
|
12 |
2023-11-30 |
link |
DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars |
|
12 |
2023-11-13 |
link |
Open-Vocabulary Video Anomaly Detection |
|
12 |
2023-12-11 |
link |
DiffCast: A Unified Framework via Residual Diffusion for Precipitation Nowcasting |
|
12 |
2024-04-03 |
link |
On the Scalability of Diffusion-based Text-to-Image Generation |
|
12 |
2024-02-08 |
link |
Question Aware Vision Transformer for Multimodal Reasoning |
|
12 |
2024-03-12 |
link |
Beyond Text: Frozen Large Language Models in Visual Signal Comprehension |
|
12 |
2023-08-25 |
link |
Residual Denoising Diffusion Models |
|
12 |
2024-03-05 |
link |
PromptKD: Unsupervised Prompt Distillation for Vision-Language Models |
|
12 |
2023-11-30 |
link |
On Exact Inversion of DPM-Solvers |
|
12 |
2024-03-07 |
link |
Depth-Aware Test-Time Training for Zero-Shot Video Object Segmentation |
|
12 |
2024-03-17 |
link |
Selective Hourglass Mapping for Universal Image Restoration Based on Diffusion Model |
|
12 |
2024-03-07 |
link |
Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed |
|
12 |
2024-01-09 |
link |
Pre-Trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness |
|
12 |
2023-11-28 |
link |
Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation |
|
12 |
2024-01-08 |
link |
MS-DETR: Efficient DETR Training with Mixed Supervision |
|
11 |
2023-12-07 |
link |
Prompt Highlighter: Interactive Control for Multi-Modal LLMs |
|
11 |
2024-03-05 |
link |
Multi-Modal Instruction Tuned LLMs with Fine-Grained Visual Perception |
|
11 |
2024-03-26 |
link |
Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models |
|
11 |
2023-11-28 |
link |
Text-Driven Image Editing via Learnable Regions |
|
11 |
2024-02-22 |
link |
CyberDemo: Augmenting Simulated Human Demonstration for Real-World Dexterous Manipulation |
|
11 |
2024-04-29 |
link |
EMOPortraits: Emotion-Enhanced Multimodal One-Shot Head Avatars |
|
11 |
2024-01-16 |
link |
TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding |
|
11 |
2023-11-25 |
link |
Point Cloud Pre-Training with Diffusion Models |
|
11 |
2024-03-28 |
link |
Mitigating Motion Blur in Neural Radiance Fields with Events and Frames |
|
11 |
2023-12-19 |
link |
Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models |
|
11 |
2024-03-12 |
link |
PeLK: Parameter-Efficient Large Kernel ConvNets with Peripheral Convolution |
|
11 |
2023-11-29 |
link |
Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models |
|
11 |
2023-12-20 |
link |
SpecNeRF: Gaussian Directional Encoding for Specular Reflections |
|
11 |
2024-01-15 |
link |
MaskClustering: View Consensus Based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation |
|
11 |
2023-09-14 |
link |
DePT: Decoupled Prompt Tuning |
|
11 |
2024-04-09 |
link |
HPNet: Dynamic Trajectory Forecasting with Historical Prediction Attention |
|
11 |
2024-06-06 |
link |
Matching Anything by Segmenting Anything |
|
11 |
2023-11-27 |
link |
VIT-LENS: Towards Omni-modal Representations |
|
11 |
2024-03-27 |
link |
Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding |
|
11 |
2024-01-18 |
link |
Towards Language-Driven Video Inpainting via Multimodal Large Language Models |
|
11 |
2023-11-23 |
link |
PointOBB: Learning Oriented Object Detection via Single Point Supervision |
|
11 |
2023-12-15 |
link |
Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation |
|
11 |
2023-04-02 |
link |
From Isolated Islands to Pangea: Unifying Semantic Space for Human Action Understanding |
|
11 |
2024-04-05 |
link |
Koala: Key Frame-Conditioned Long Video-LLM |
|
11 |
2024-01-04 |
link |
Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions |
|
11 |
2024-06-16 |
link |
VideoLLM-online: Online Video Large Language Model for Streaming Video |
|
11 |
2023-11-30 |
link |
Can Protective Perturbation Safeguard Personal Data from Being Exploited by Stable Diffusion? |
|
11 |
2023-11-17 |
link |
High-fidelity Person-centric Subject-to-Image Synthesis |
|
11 |
2024-04-11 |
link |
GoMAvatar: Efficient Animatable Human Modeling from Monocular Video Using Gaussians-on-Mesh |
|
11 |
2024-03-04 |
link |
Neural Redshift: Random Networks are not Random Functions |
|
11 |
2024-03-17 |
link |
A Dual-Augmentor Framework for Domain Generalization in 3D Human Pose Estimation |
|
11 |
2024-04-22 |
link |
AutoAD III: The Prequel - Back to the Pixels |
|
11 |
2024-02-29 |
link |
SeD: Semantic-Aware Discriminator for Image Super-Resolution |
|
11 |
2023-11-29 |
link |
Continual Self-Supervised Learning: Towards Universal Multi-Modal Medical Data Representation Learning |
|
11 |
2024-02-13 |
link |
PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models |
|
11 |
2024-02-12 |
link |
Collaborative Semantic Occupancy Prediction with Hybrid Feature Fusion in Connected Automated Vehicles |
|
11 |
2023-10-26 |
link |
SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching |
|
11 |
2024-01-16 |
link |
MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World |
|
11 |
2024-01-16 |
link |
SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation |
|
11 |
2023-11-18 |
link |
Implicit Event-RGBD Neural SLAM |
|
11 |
2023-08-15 |
link |
Link-Context Learning for Multimodal LLMs |
|
11 |
2024-02-28 |
link |
TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding |
|
11 |
2023-12-31 |
link |
EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling |
|
11 |
2023-12-13 |
link |
See, Say, and Segment: Teaching LMMs to Overcome False Premises |
|
11 |
2024-01-04 |
link |
BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model |
|
11 |
2023-11-23 |
link |
Posterior Distillation Sampling |
|
11 |
2024-03-21 |
link |
EventDance: Unsupervised Source-Free Cross-Modal Adaptation for Event-Based Object Recognition |
|
11 |
2023-12-12 |
link |
DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing |
|
11 |
2024-02-27 |
link |
VoCo: A Simple-Yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis |
|
10 |
2024-03-24 |
link |
Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement |
|
10 |
2023-04-24 |
link |
End-to-End Spatio-Temporal Action Localisation with Video Transformers |
|
10 |
2023-11-03 |
link |
HIPTrack: Visual Tracking with Historical Prompts |
|
10 |
2023-12-05 |
link |
Alchemist: Parametric Control of Material Properties with Diffusion Models |
|
10 |
2023-12-14 |
link |
Promptable Behaviors: Personalizing Multi-Objective Rewards from Human Preferences |
|
10 |
2024-03-25 |
link |
RCBEVDet: Radar-Camera Fusion in Bird's Eye View for 3D Object Detection |
|
10 |
2023-11-28 |
link |
End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames |
|
10 |
2023-12-19 |
link |
Mask Grounding for Referring Image Segmentation |
|
10 |
2023-08-15 |
link |
Relightable and Animatable Neural Avatar from Sparse-View Video |
|
10 |
2023-12-11 |
link |
Localization is All You Evaluate: Data Leakage in Online Mapping Datasets and How to Fix it |
|
10 |
2023-12-11 |
link |
MonoNPHM: Dynamic Head Reconstruction from Monocular Videos |
|
10 |
2024-05-03 |
link |
On the Test-Time Zero-Shot Generalization of Vision-Language Models: Do we Really need Prompt Learning? |
|
10 |
2023-12-04 |
link |
ArtAdapter: Text-to-Image Style Transfer using Multi-Level Style Encoder and Explicit Adaptation |
|
10 |
2024-04-01 |
link |
MGMap: Mask-Guided Learning for Online Vectorized HD Map Construction |
|
10 |
2024-05-15 |
link |
SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge |
|
10 |
2023-05-24 |
link |
InNeRF360: Text-Guided 3D-Consistent Object Inpainting on 360° Neural Radiance Fields |
|
10 |
2023-12-06 |
link |
UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity |
|
10 |
2024-03-22 |
link |
IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection |
|
10 |
2024-06-05 |
link |
AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection |
|
10 |
2023-12-01 |
link |
Segment and Caption Anything |
|
10 |
2023-06-20 |
link |
CrossKD: Cross-Head Knowledge Distillation for Object Detection |
|
10 |
2023-11-07 |
link |
3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features |
|
10 |
2023-12-20 |
link |
A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models |
|
10 |
2023-12-28 |
link |
Unsupervised Universal Image Segmentation |
|
10 |
2024-02-29 |
link |
PEM: Prototype-Based Efficient MaskFormer for Image Segmentation |
|
10 |
2023-10-23 |
link |
MAS: Multi-view Ancestral Sampling for 3D Motion Generation Using 2D Diffusion |
|
10 |
2024-04-10 |
link |
UDiFF: Generating Conditional Unsigned Distance Fields with Optimal Wavelet Diffusion |
|
10 |
2024-01-16 |
link |
Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary |
|
10 |
2024-05-19 |
link |
Morphological Prototyping for Unsupervised Slide Representation Learning in Computational Pathology |
|
10 |
2023-12-20 |
link |
Improving Semantic Correspondence with Viewpoint-Guided Spherical Maps |
|
10 |
2023-10-27 |
link |
ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image |
|
10 |
2023-12-03 |
link |
Language-driven All-in-one Adverse Weather Removal |
|
10 |
2023-11-23 |
link |
Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-End Oriented Object Detection with Single Point Supervision |
|
10 |
2023-12-04 |
link |
PaSCo: Urban 3D Panoptic Scene Completion with Uncertainty Awareness |
|
10 |
2024-02-29 |
link |
CricaVPR: Cross-Image Correlation-Aware Representation Learning for Visual Place Recognition |
|
10 |
2023-11-28 |
link |
As-Plausible-As-Possible: Plausibility-Aware Mesh Deformation Using 2D Diffusion Priors |
|