Last updated: 2024-12-09 08:48:03. Maintained by Weisen Jiang.

citation date review title (pdf) authors
1580 2023-10-05 link Improved Baselines with Visual Instruction Tuning
423 2023-11-27 link MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark
for Expert AGI
360 2023-04-17 link DETRs Beat YOLOs on Real-time Object Detection
351 2024-01-19 link Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
346 2023-10-12 link 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering
264 2023-11-07 link mPLUG-OwI2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
264 2023-10-23 link Wonder3D: Single Image to 3D Using Cross-Domain Diffusion
243 2023-08-01 link LISA: Reasoning Segmentation via Large Language Model
222 2023-09-22 link Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction
204 2023-04-06 link InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning
198 2023-11-21 link SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction
and High-Quality Mesh Rendering
182 2023-11-28 link Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character
Animation
181 2023-12-12 link VILA: On Pre-training for Visual Language Models
173 2023-11-28 link MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
170 2023-11-27 link Mip-Splatting: Alias-Free 3D Gaussian Splatting
167 2023-07-18 link AnyDoor: Zero-shot Object-level Image Customization
166 2024-01-11 link Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal
LLMs
160 2023-11-11 link Monkey: Image Resolution and Text Label are Important Things
for Large Multi-Modal Models
158 2023-09-28 link Text-to-3D using Gaussian Splatting
158 2023-12-04 link SplaTAM: Splat, Track & Map 3D Gaussians for Dense
RGB-D SLAM
157 2023-12-20 link Generative Multimodal Models are In-Context Learners
147 2023-03-08 link Video-P2P: Video Editing with Cross-Attention Control
141 2023-12-11 link Gaussian Splatting SLAM
135 2023-07-31 link MovieChat: From Dense Token to Sparse Memory for Long
Video Understanding
134 2024-01-17 link VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
133 2023-11-29 link VBench: Comprehensive Benchmark Suite for Video Generative Models
132 2023-11-30 link Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering
132 2023-07-13 link HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models
125 2023-06-26 link DragDiffusion: Harnessing Diffusion Models for Interactive Point-Based Image Editing
122 2024-01-30 link YOLO-World: Real-Time Open-Vocabulary Object Detection
121 2023-11-14 link One-2-3-45++: Fast Single Image to 3D Objects with Consistent
Multi-View Generation and 3D Diffusion
120 2023-12-19 link PixelSplat: 3D Gaussian Splats from Image Pairs for Scalable
Generalizable 3D Reconstruction
119 2023-12-21 link DUSt3R: Geometric 3D Vision Made Easy
119 2023-11-20 link GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting
116 2023-11-24 link GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting
115 2023-11-30 link One-Step Diffusion with Distribution Matching Distillation
114 2023-11-28 link Mitigating Object Hallucinations in Large Vision-Language Models through Visual
Contrastive Decoding
113 2023-12-14 link Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D
Reconstruction with Transformers
112 2023-11-27 link MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
110 2023-04-03 link DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion
Models
108 2023-11-21 link Diffusion Model Alignment Using Direct Preference Optimization
108 2023-12-01 link RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-Grained
Correctional Human Feedback
108 2023-11-19 link LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching
105 2023-11-06 link GLaMM: Pixel Grounding Large Multimodal Model
105 2023-12-01 link Sequential Modeling Enables Scalable Learning for Large Vision Models
103 2023-11-29 link OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via
Over-Trust Penalty and Retrospection-Allocation
100 2023-11-14 link Chat-UniVi: Unified Visual Representation Empowers Large Language Models with
Image and Video Understanding
99 2023-11-22 link Compact 3D Gaussian Representation for Radiance Field
98 2023-11-20 link PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics
97 2023-12-07 link PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding
97 2023-12-20 link Splatter Image: Ultra-Fast Single-View 3D Reconstruction
95 2023-12-05 link ReconFusion: 3D Reconstruction with Diffusion Priors
90 2023-12-28 link Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language,
Audio, and Action
89 2023-12-13 link DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving
Scenes
88 2023-12-01 link EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything
87 2023-04-12 link An Edit Friendly DDPM Noise Space: Inversion and Manipulations
86 2023-12-04 link TimeChat: A Time-sensitive Multimodal Large Language Model for Long
Video Understanding
85 2023-12-26 link LangSplat: 3D Language Gaussian Splatting
85 2023-10-23 link Hallusionbench: An Advanced Diagnostic Suite for Entangled Language Hallucination
and Visual Illusion in Large Vision-Language Models
83 2024-01-22 link SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
83 2023-05-14 link ULIP-2: Towards Scalable Multimodal Pre-Training for 3D Understanding
82 2023-11-29 link GaussianShader: 3D Gaussian Splatting with Shading Functions for Reflective
Surfaces
82 2023-12-04 link SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes
82 2023-03-30 link InceptionNeXt: When Inception Meets ConvNeXt
81 2023-12-06 link Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled
Feature Fields
79 2024-02-29 link Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
79 2023-09-20 link FreeU: Free Lunch in Diffusion U-Net
78 2023-12-21 link Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and
Composed Diffusion Models
76 2023-12-05 link Analyzing and Improving the Training Dynamics of Diffusion Models
75 2023-06-08 link Grounded Text-to-Image Synthesis with Attention Refocusing
75 2023-12-13 link FoundationPose: Unified 6D Pose Estimation and Tracking of Novel
Objects
74 2023-03-16 link HIVE: Harnessing Human Feedback for Instructional Visual Editing
74 2023-12-04 link Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation
74 2023-12-04 link GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians
73 2023-11-28 link RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness
in Text-to-3D
72 2023-10-17 link EvalCrafter: Benchmarking and Evaluating Large Video Generation Models
72 2023-11-16 link Emu Edit: Precise Image Editing via Recognition and Generation
Tasks
68 2023-12-01 link DeepCache: Accelerating Diffusion Models for Free
68 2023-12-11 link Honeybee: Locality-Enhanced Projector for Multimodal LLM
68 2023-12-28 link Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis
67 2023-11-14 link UFOGen: You Forward Once Large Scale Text-to-Image Generation via
Diffusion GANs
66 2024-02-27 link VastGaussian: Vast 3D Gaussians for Large Scene Reconstruction
65 2023-12-04 link Style Aligned Image Generation via Shared Attention
65 2023-11-27 link GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions
64 2023-12-21 link V*: Guided Visual Search as a Core Mechanism in
Multimodal LLMs
64 2023-11-26 link GS-IR: 3D Gaussian Splatting for Inverse Rendering
63 2023-11-29 link 4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling
62 2023-10-12 link GaussianDreamer: Fast Generation from Text to 3D Gaussians by
Bridging 2D and 3D Diffusion Models
62 2023-08-15 link CoDeF: Content Deformation Fields for Temporally Consistent Video Processing
62 2023-03-21 link CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation
61 2023-09-07 link InstructDiffusion: A Generalist Modeling Interface for Vision Tasks
61 2023-12-12 link LMDrive: Closed-Loop End-to-End Driving with Large Language Models
60 2023-12-12 link COLMAP-Free 3D Gaussian Splatting
60 2023-11-29 link Driving Into the Future: Multiview Visual Forecasting and Planning
with World Model for Autonomous Driving
60 2024-01-24 link Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic
Image Restoration In the Wild
60 2023-12-15 link SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal
Interpretation for Earth Observation Imagery
59 2023-11-27 link SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution
58 2024-01-08 link GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation
58 2023-11-28 link Photo-SLAM: Real-Time Simultaneous Localization and Photorealistic Mapping for Monocular,
Stereo, and RGB-D Cameras
57 2023-08-18 link SimDA: Simple Diffusion Adapter for Efficient Video Generation
56 2023-11-17 link Compressed 3D Gaussian Splatting for Accelerated Novel View Synthesis
56 2023-11-30 link Rethinking FID: Towards a Better Evaluation Metric for Image
Generation
56 2023-12-06 link OneLLM: One Framework to Align All Modalities with Language
56 2023-12-05 link GauHuman: Articulated Gaussian Splatting from Monocular Human Videos
55 2023-12-04 link GPS-Gaussian: Generalizable Pixel-Wise 3D Gaussian Splatting for Real-Time Human
Novel View Synthesis
55 2023-10-19 link Putting the Object Back into Video Object Segmentation
55 2023-12-04 link GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single
Video via Animatable 3D Gaussians
55 2023-11-18 link Make Pixels Dance: High-Dynamic Video Generation
54 2023-11-28 link HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting
54 2023-11-29 link HUGS: Human Gaussian Splats
54 2023-12-14 link 3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting
53 2023-11-10 link Florence-2: Advancing a Unified Representation for a Variety of
Vision Tasks
53 2023-12-01 link ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
53 2023-11-28 link SEED-Bench-2: Benchmarking Multimodal Large Language Models
52 2023-11-27 link UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video,
Point Cloud, Time-Series and Image Recognition
52 2023-11-30 link VTimeLLM: Empower LLM to Grasp Video Moments
51 2023-12-06 link Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle
50 2023-11-27 link MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers
50 2023-11-28 link Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering
50 2023-06-16 link PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation
49 2024-06-16 link OpenEQA: Embodied Question Answering in the Era of Foundation
Models
49 2024-03-27 link UniDepth: Universal Monocular Metric Depth Estimation
48 2023-11-30 link Diffusion Models Without Attention
48 2024-03-11 link DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local
Depth Normalization
47 2023-11-29 link MoMask: Generative Masked Modeling of 3D Human Motions
47 2023-12-24 link ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic
Manipulation
46 2024-04-12 link Probing the 3D Awareness of Visual Foundation Models
46 2023-10-02 link HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic
3D Human Generation
45 2023-12-07 link Scaling Laws of Synthetic Images for Model Training …
for Now
45 2023-12-15 link Osprey: Pixel Understanding with Visual Instruction Tuning
44 2023-12-15 link PLGSLAM: Progressive Neural Scene Represenation with Local to Global
Bundle Adjustment
44 2023-11-22 link Using Human Feedback to Fine-tune Diffusion Models without Any
Reward Model
43 2023-12-06 link Alpha-CLIP: A CLIP Model Focusing on Wherever you Want
43 2023-05-25 link Prompt-Free Diffusion: Taking “Text” Out of Text-to-Image Diffusion Models
43 2023-06-20 link Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D
Scene Scale and Realism Tradeoffs for ObjectGoal Navigation
43 2023-11-27 link Compositional Chain-of-Thought Prompting for Large Multimodal Models
42 2023-04-03 link RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene
Understanding
42 2023-08-23 link Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable
Diffusion
42 2023-11-30 link Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding
42 2023-09-06 link Bayes' Rays: Uncertainty Quantification for Neural Radiance Fields
41 2023-11-27 link GART: Gaussian Articulated Template Models
41 None link EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion
Prior
40 2023-11-28 link Human Gaussian Splatting: Real-Time Rendering of Animatable Avatars
40 2023-10-17 link 4K4D: Real-Time 4D View Synthesis at 4K Resolution
40 2023-09-14 link Generative Image Dynamics
39 2024-03-08 link SplattingAvatar: Realistic Real-Time Human Avatars With Mesh-Embedded Gaussian Splatting
39 2023-11-21 link SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction
38 2023-10-16 link LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation
38 2023-12-06 link Relightable Gaussian Codec Avatars
38 2023-07-14 link NIFTY: Neural Object Interaction Fields for Guided Human Motion
Synthesis
38 2023-06-30 link Disco: Disentangled Control for Realistic Human Dance Generation
38 2023-12-08 link Reconstructing Hands in 3D with Transformers
37 2024-02-08 link MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis
36 2023-11-22 link HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data
36 2023-12-10 link ASH: Animatable Gaussian Splats for Efficient and Photoreal Human
Rendering
36 2023-12-14 link Holodeck: Language Guided Generation of 3D Embodied AI Environments
36 2024-02-05 link InstanceDiffusion: Instance-Level Control for Image Generation
36 2023-06-27 link Symphonize 3D Semantic Scene Completion with Contextual Instance Queries
36 2023-11-27 link Optimal Transport Aggregation for Visual Place Recognition
36 2023-11-30 link LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning,
and Planning
35 2023-11-28 link A Unified Approach for Text-and Image-Guided 4D Scene Generation
35 2023-10-31 link CapsFusion: Rethinking Image-Text Data at Scale
35 2023-11-28 link LEDITS++: Limitless Image Editing Using Text-to-Image Models
34 2023-12-26 link DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D
Vision
34 2023-12-21 link Paint3D: Paint Anything 3D With Lighting-Less Texture Diffusion Models
34 2023-12-08 link SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation
33 2023-09-20 link RMT: Retentive Networks Meet Vision Transformers
33 2024-01-18 link OMG-Seg: Is One Model Good Enough for all Segmentation?
32 2024-03-18 link Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters
32 2023-12-04 link PixelLM: Pixel Reasoning with Large Multimodal Model
32 2024-04-08 link MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
32 2023-12-12 link FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model
with Any Condition
32 2023-12-05 link Is Ego Status All You Need for Open-Loop End-to-End
Autonomous Driving?
31 2023-12-02 link Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction
31 2023-12-11 link SmartEdit: Exploring Complex Instruction-Based Image Editing with Multimodal Large
Language Models
31 2023-11-27 link CoSeR: Bridging Image and Language for Cognitive Super-Resolution
31 2023-06-01 link Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion
Models
31 2023-12-01 link VideoBooth: Diffusion-based Video Generation with Image Prompts
30 2023-12-12 link Hallucination Augmented Contrastive Learning for Multimodal Large Language Model
30 2023-11-27 link SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose
Estimation
30 2023-12-26 link SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation
30 2023-08-18 link Towards Large-Scale 3D Representation Learning with Multi-Dataset Point Prompt
Training
30 2023-11-27 link CG-HOI: Contact-Guided 3D Human-Object Interaction Generation
29 2023-12-07 link Free3D: Consistent Novel View Synthesis Without 3D Representation
29 2023-12-11 link Style Injection in Diffusion: A Training-Free Approach for Adapting
Large-Scale Diffusion Models for Style Transfer
29 2023-12-12 link WHAM: Reconstructing World-Grounded Humans with Accurate 3D Motion
29 2023-11-26 link NeuRAD: Neural Rendering for Autonomous Driving
29 2023-12-17 link Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance
29 2023-11-30 link GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs
29 2024-03-20 link Multi-Modal Hallucination Control by Visual Information Grounding
29 2023-11-23 link SinSR: Diffusion-Based Image Super-Resolution in a Single Step
29 2023-12-06 link Cache Me if You Can: Accelerating Diffusion Models through
Block Caching
28 2024-03-03 link 3DGStream: On-the-Fly Training of 3D Gaussians for Efficient Streaming
of Photo-Realistic Free-Viewpoint Videos
28 2023-11-29 link Gaussian Shell Maps for Efficient 3D Human Generation
28 2024-02-08 link Editable Scene Simulation for Autonomous Driving via Collaborative LLM-Agents
28 2023-09-28 link CCEdit: Creative and Controllable Video Editing via Diffusion Models
28 2023-11-18 link SNI-SLAM: Semantic Neural Implicit SLAM
28 2023-12-18 link Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization
and Physically-Based Rendering
28 2023-10-12 link UniPAD: A Universal Pre-Training Paradigm for Autonomous Driving
28 2023-11-24 link DemoFusion: Democratising High-Resolution Image Generation With No $$$
27 2024-01-31 link Binding Touch to Everything: Learning Unified Multimodal Tactile Representations
27 2023-05-24 link RoMa: Robust Dense Feature Matching
27 2023-10-01 link Pink: Unveiling the Power of Referential Comprehension for Multi-modal
LLMs
27 2023-12-11 link PortraitBooth: A Versatile Portrait Model for Fast Identity-Preserved Personalization
27 2024-02-27 link Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent
Aligners
27 2023-11-29 link MMA-Diffusion: MultiModal Attack on Diffusion Models
27 2023-12-14 link Mosaic-SDF for 3D Generative Models
27 2024-03-10 link MACE: Mass Concept Erasure in Diffusion Models
27 2023-11-28 link Panacea: Panoramic and Controllable Video Generation for Autonomous Driving
26 2023-12-07 link LaMPilot: An Open Benchmark Dataset for Autonomous Driving with
Language Model Programs
26 2023-11-28 link TransNeXt: Robust Foveal Visual Perception for Vision Transformers
26 2023-11-30 link BioCLIP: A Vision Foundation Model for the Tree of
Life
26 2023-11-20 link BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning
26 2024-01-17 link Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling
Prior
26 2023-12-07 link RAVE: Randomized Noise Shuffling for Fast and Consistent Video
Editing with Diffusion Models
26 2024-02-22 link Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis
25 2024-03-14 link Generalized Predictive Model for Autonomous Driving
25 2023-12-01 link VMC: Video Motion Customization Using Temporal Attention Adaption for
Text-to-Video Diffusion Models
25 2024-03-29 link Rewrite the Stars
25 2024-01-17 link GARField: Group Anything with Radiance Fields
25 2024-03-11 link FreGS: 3D Gaussian Splatting with Progressive Frequency Regularization
25 2023-12-26 link EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied
AI
25 2023-09-26 link Event Stream-Based Visual Object Tracking: A High-Resolution Benchmark Dataset
and A Novel Baseline
25 2024-01-03 link Instruct-Imagen: Image Generation with Multi-modal Instruction
25 2023-12-18 link GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning
25 2023-12-03 link ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models
24 2024-02-29 link DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
24 2024-03-10 link Poly Kernel Inception Network for Remote Sensing Detection
24 2023-12-14 link General Object Foundation Model for Images and Videos at
Scale
24 2023-09-01 link CityDreamer: Compositional Generative Model of Unbounded 3D Cities
24 2023-12-14 link A Picture is Worth More Than 77 Text Tokens:
Evaluating CLIP-Style Models on Dense Captions
24 2023-04-13 link Modeling Dense Multimodal Interactions Between Biological Pathways and Histology
for Survival Prediction
24 2023-11-28 link Shadows Don't Lie and Lines Can't Bend! Generative Models
Don't know Projective Geometry…for Now
24 2024-02-27 link Neural Video Compression with Feature Modulation
23 2023-12-06 link WonderJourney: Going from Anywhere to Everywhere
23 2023-11-27 link Self-Correcting LLM-Controlled Diffusion Models
23 2023-11-30 link CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation
23 2023-11-23 link GigaPose: Fast and Robust Novel Object Pose Estimation via
One Correspondence
23 2024-03-04 link ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models
23 2023-11-28 link Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following
23 2023-11-28 link LLaFS: When Large Language Models Meet Few-Shot Segmentation
23 2023-12-29 link Visual Point Cloud Forecasting Enables Scalable Autonomous Driving
23 2023-12-06 link HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting
22 2023-12-26 link One-dimensional Adapter to Rule Them All: Concepts, Diffusion Models
and Erasing Applications
22 2023-09-04 link Can I Trust Your Answer? Visually Grounded Video Question
Answering
22 2023-12-19 link Optimizing Diffusion Noise Can Serve As Universal Motion Priors
22 2024-04-05 link SpatialTracker: Tracking Any 2D Pixels in 3D Space
22 2023-09-06 link Diffusion-EDFs: Bi-Equivariant Denoising Generative Modeling on SE(3) for Visual
Robotic Manipulation
22 2024-02-04 link DiffEditor: Boosting Accuracy and Flexibility on Diffusion-Based Image Editing
22 2024-04-09 link 3D Geometry-aware Deformable Gaussian Splatting for Dynamic View Synthesis
22 2023-12-17 link VidToMe: Video Token Merging for Zero-Shot Video Editing
22 2023-11-19 link Transcending Forgery Specificity with Latent Space Augmentation for Generalizable
Deepfake Detection
22 2024-02-26 link Groundhog Grounding Large Language Models to Holistic Segmentation
22 2023-12-27 link Forgery-aware Adaptive Transformer for Generalizable Synthetic Image Detection
22 2024-03-12 link ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for
Dense Predictions
22 2023-12-12 link EditGuard: Versatile Image Watermarking for Tamper Localization and Copyright
Protection
22 2024-02-29 link Towards Generalizable Tumor Synthesis
22 2023-12-17 link SAI3D: Segment any Instance in 3D Scenes
21 2024-03-04 link RegionGPT: Towards Region Understanding Vision Language Model
21 2023-11-30 link Fast ODE-based Sampling for Diffusion Models in Around 5
Steps
21 2023-11-24 link OneFormer3D: One Transformer for Unified Point Cloud Segmentation
21 2023-11-29 link SODA: Bottleneck Diffusion Models for Representation Learning
21 2023-12-04 link VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point
Correspondence
21 2023-11-27 link SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation
21 2023-06-07 link WOUAF: Weight Modulation for User Attribution and Fingerprinting in
Text-to-Image Diffusion Models
21 2023-12-01 link Grounding Everything: Emerging Localization Properties in Vision-Language Transformers
21 2024-03-06 link Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation
21 2023-11-26 link BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP
21 2024-02-27 link Preserving Fairness Generalization in Deepfake Detection
20 2024-03-11 link Toward Generalist Anomaly Detection via In-Context Residual Learning with
Few-Shot Sample Prompts
20 2024-03-03 link Logit Standardization in Knowledge Distillation
20 2024-03-19 link Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial
Anomaly Detection
20 2024-01-23 link The Neglected Tails in Vision-Language Models
20 2024-03-06 link Towards Understanding Cross and Self-Attention in Stable Diffusion for
Text-Guided Image Editing
20 2023-12-13 link FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models
20 2023-12-07 link Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for
Domain Generalized Semantic Segmentation
20 2024-03-02 link TUMTraf V2X Cooperative Perception Dataset
20 2024-02-14 link OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical
LVLM
20 2023-12-07 link Hierarchical Spatio-temporal Decoupling for Text-to- Video Generation
20 2023-11-28 link Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer
20 2023-12-15 link Rich Human Feedback for Text-to-Image Generation
20 2024-03-19 link HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting
20 2023-09-29 link Text-Image Alignment for Diffusion-Based Perception
20 2023-12-01 link Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion
20 2023-11-29 link SyncTalk: The Devil is in the Synchronization for Talking
Head Synthesis
20 2023-12-14 link Auto MC-Reward: Automated Dense Reward Design with Large Language
Models for Minecraft
20 2024-02-06 link EscherNet: A Generative Model for Scalable View Synthesis
20 2023-11-27 link Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion
19 2023-06-27 link Detector-Free Structure from Motion
19 2023-12-06 link MMM: Generative Masked Motion Model
19 2024-02-07 link SPAD: Spatially Aware Multi-View Diffusers
19 2023-12-07 link GenTron: Diffusion Transformers for Image and Video Generation
19 2023-12-05 link Orthogonal Adaptation for Modular Customization of Diffusion Models
19 2023-09-12 link Language Models as Black-Box Optimizers for Vision-Language Models
19 2024-01-03 link From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
19 2023-12-08 link ControlRoom3D: Room Generation Using Semantic Proxy Rooms
19 2023-12-29 link FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis
19 2024-03-18 link Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning
19 2023-12-06 link On the Diversity and Realism of Distilled Dataset: An
Efficient Dataset Distillation Paradigm
19 2024-02-15 link DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization
19 2024-02-08 link Driving Everywhere with Large Language Model Policy Adaptation
19 2023-11-22 link Visual in-Context Prompting
19 2023-11-28 link Wavelet-based Fourier Information Interaction with Frequency Diffusion Adjustment for
Underwater Image Restoration
19 2023-12-12 link CLIP as RNN: Segment Countless Visual Concepts without Training
Endeavor
19 2023-12-21 link PIA: Your Personalized Image Animator via Plug-and-Play Modules in
Text-to-Image Models
19 2024-02-23 link State Space Models for Event Cameras
19 2023-12-18 link SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based
Task Execution
19 2023-11-28 link SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion
Priors
19 2023-11-28 link MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
19 2024-04-04 link WorDepth: Variational Language Prior for Monocular Depth Estimation
19 2023-12-12 link MP5: A Multi-modal Open-ended Embodied System in Minecraft via
Active Perception
19 2023-12-06 link On the Robustness of Large Multimodal Models Against Image
Adversarial Attacks
18 2024-01-10 link VLP: Vision Language Planning for Autonomous Driving
18 2023-12-05 link GPT4Point: A Unified Framework for Point-Language Understanding and Generation
18 2023-12-11 link DreamControl: Control-Based Text-to-3D Generation with 3D Self-Prior
18 2024-03-13 link Scaling Up Dynamic Human-Scene Interaction Modeling
18 2023-11-11 link PerceptionGPT: Effectively Fusing Visual Perception Into LLM
18 2024-01-11 link Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for
Vision Applications
18 2023-12-15 link GSVA: Generalized Segmentation via Multimodal Large Language Models
18 2024-01-17 link Vlogger: Make Your Dream A Vlog
18 2024-04-15 link PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI
18 2024-04-08 link LayoutLLM: Layout Instruction Tuning with Large Language Models for
Document Understanding
18 2023-12-19 link InstructVideo: Instructing Video Diffusion Models with Human Feedback
18 2023-12-11 link EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion
18 2023-12-11 link CONFORM: Contrast is All You Need For High-Fidelity Text-to-Image
Diffusion Models
18 2023-12-04 link COTR: Compact Occupancy TRansformer for Vision-Based 3D Occupancy Prediction
18 2023-05-25 link Learning Occupancy for Monocular 3D Object Detection
18 2023-11-20 link OmniSeg3D: Omniversal 3D Segmentation via Hierarchical Contrastive Learning
17 2023-12-11 link Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior
17 2023-08-20 link Boosting Adversarial Transferability by Block Shuffle and Rotation
17 2024-03-24 link EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric
View of Procedural Activities in Real World
17 2023-06-15 link Generative Proxemics: A Prior for 3D Social Interaction from
Images
17 2024-01-17 link TextureDreamer: Image-Guided Texture Synthesis through Geometry-Aware Diffusion
17 2024-04-01 link Video Interpolation with Diffusion Models
17 2023-12-05 link Let's Think Outside the Box: Exploring Leap-of-Thought in Large
Language Models with Creative Humor Generation
17 2023-05-19 link Equivariant Multi-Modality Image Fusion
17 2024-02-20 link Video ReCap: Recursive Captioning of Hour-Long Videos
17 2023-12-02 link Diffusion Handles Enabling 3D Edits for Diffusion Models by
Lifting Activations to 3D
17 2023-12-12 link A Simple Recipe for Contrastively Pre-Training Video-First Encoders Beyond
16 Frames
17 2023-03-25 link UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes
16 2024-01-31 link AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder
Reconstruction Error
16 2023-12-21 link VCoder: Versatile Vision Encoders for Multimodal Large Language Models
16 2024-04-04 link Decoupling Static and Hierarchical Motion Perception for Referring Video
Segmentation
16 2023-12-04 link Aligning and Prompting Everything All at Once for Universal
Visual Perception
16 2023-11-28 link Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence
16 2024-01-04 link Learning the 3D Fauna of the Web
16 2024-02-14 link Loopy-SLAM: Dense Neural SLAM with Loop Closures
16 2023-11-20 link DiffAvatar: Simulation-Ready Garment Optimization with Differentiable Simulation
16 2023-12-19 link Jack of All Tasks, Master of Many: Designing General-purpose
Coarse-to-Fine Vision-Language Model
16 2023-12-10 link SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed
Human Reconstruction
16 2023-12-04 link Towards Learning a Generalist Model for Embodied Navigation
16 2023-05-31 link Control4D: Efficient 4D Portrait Editing With Text
16 2023-11-09 link Mirasol3B: A Multimodal Autoregressive Model for Time-Aligned and Contextual
Modalities
16 2023-12-01 link Dense Optical Tracking: Connecting the Dots
16 2023-12-18 link SCEdit: Efficient and Controllable Image Diffusion Generation via Skip
Connection Editing
16 2023-12-07 link Open-Vocabulary Segmentation with Semantic-Assisted Calibration
16 2023-12-21 link ZeroShape: Regression-Based Zero-Shot Shape Reconstruction
16 2024-03-01 link Rethinking Inductive Biases for Surface Normal Estimation
16 2024-03-18 link MCD: Diverse Large-Scale Multi-Campus Dataset for Robot Perception
16 2023-12-05 link Describing Differences in Image Sets with Natural Language
16 2023-11-28 link Embodied Multi-Modal Agent trained by an LLM from a
Parallel TextWorld
15 2023-12-28 link ZONE: Zero-Shot Instruction-Guided Local Editing
15 2024-03-26 link Move as you Say, Interact as you can: Language-Guided
Human Motion Generation with Scene Affordance
15 2023-12-18 link CLOVA: A Closed-LOop Visual Assistant with Tool Usage and
Update
15 2023-12-06 link AVID: Any-Length Video Inpainting with Diffusion Model
15 2024-04-15 link SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy
Prediction
15 2023-11-27 link TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models
15 2023-12-05 link Visual Program Distillation: Distilling Tools and Programmatic Reasoning into
Vision-Language Models
15 2024-03-08 link Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery
15 2023-12-12 link Peekaboo: Interactive Video Generation via Masked-Diffusion
15 2024-04-10 link HRVDA: High-Resolution Visual Document Assistant
15 2024-01-03 link A Vision Check-up for Language Models
15 2024-04-30 link XFeat: Accelerated Features for Lightweight Image Matching
15 2024-01-25 link pix2gestalt: Amodal Segmentation by Synthesizing Wholes
15 2023-11-30 link MotionEditor: Editing Video Motion via Content-Aware Diffusion
15 2023-12-26 link HarmonyView: Harmonizing Consistency and Diversity in One-Image-to-3D
15 2023-12-05 link FINER: Flexible Spectral-Bias Tuning in Implicit NEural Representation by
Variableperiodic Activation Functions
15 2024-05-07 link DriveWorld: 4D Pre-Trained Scene Understanding via World Models for
Autonomous Driving
15 2023-01-26 link Discovering and Mitigating Visual Biases Through Keyword Explanation
15 2024-04-07 link Gaussian Shading: Provable Performance-Lossless Image Watermarking for Diffusion Models
15 2023-11-27 link SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion
15 2023-11-26 link Insect-Foundation: A Foundation Model and Large-Scale 1M Dataset for
Visual Insect Understanding
15 2023-11-27 link Single-Model and Any-Modality for Video Object Tracking
15 2024-03-04 link HanDiffuser: Text-to-Image Generation with Realistic Hand Appearances
15 2023-12-19 link Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image
Segmentation
15 2024-03-14 link OneTracker: Unifying Visual Object Tracking with Foundation Models and
Efficient Tuning
15 2023-12-27 link SVGDreamer: Text Guided SVG Generation with Diffusion Model
15 2023-05-19 link DAP: A Dynamic Adversarial Patch for Evading Person Detectors
14 2023-10-03 link Sieve: Multimodal Dataset Pruning Using Image Captioning Models
14 2024-03-09 link Robust Emotion Recognition in Context Debiasing
14 2024-01-09 link DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-Driven Holistic 3D
Expression and Gesture Generation
14 2024-02-19 link Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with
Queryable Objects and Open-Set Relationships
14 2023-12-05 link Multimodal Prompt Perceiver: Empower Adaptiveness, Generalizability and Fidelity for
All-in-One Image Restoration
14 2023-12-20 link Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis
14 2023-12-07 link Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models
14 2023-07-14 link SubT-MRS Dataset: Pushing SLAM Towards All-weather Environments
14 2023-11-27 link Text2Loc: 3D Point Cloud Localization from Natural Language
14 2024-04-06 link Initno: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization
14 2023-05-27 link Zero-TPrune: Zero-Shot Token Pruning Through Leveraging of the Attention
Graph in Pre-Trained Transformers
14 2023-12-07 link Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from
Open-Source Histopathology Videos
14 2024-03-25 link TRIP: Temporal Residual Learning with Image Noise Prior for
Image-to-Video Diffusion Models
14 2024-03-15 link IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation
14 2023-08-22 link MatFuse: Controllable Material Generation with Diffusion Models
14 2024-03-11 link DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations
14 2024-01-01 link Retrieval-Augmented Egocentric Video Captioning
14 2023-12-07 link NeRFiller: Completing Scenes via Generative 3D Inpainting
14 2024-06-16 link SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes
14 2023-12-05 link C3: High-Performance and Low-Complexity Neural Compression from a Single
Image or Video
14 2024-03-30 link InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning
14 2024-01-29 link SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design
14 2023-12-03 link FlashAvatar: High-Fidelity Head Avatar with Efficient Gaussian Embedding
14 2023-12-09 link CoGS: Controllable Gaussian Splatting
14 2024-04-01 link Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic
Propagation
14 2023-12-28 link Improving Image Restoration Through Removing Degradations in Textual Representations
14 2023-12-26 link Inter-X: Towards Versatile Human-Human Interaction Analysis
14 2023-11-29 link MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning
14 2024-05-29 link NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the
Wild
13 2023-10-10 link MuseChat: A Conversational Music Recommendation System for Videos
13 2024-02-23 link Seamless Human Motion Composition with Blended Positional Encodings
13 2023-12-25 link A Recipe for Scaling up Text-to-Video Generation with Text-free
Videos
13 2024-02-27 link VRP-SAM: SAM with Visual Reference Prompt
13 2024-03-15 link RCooper: A Real-world Large-scale Dataset for Roadside Cooperative Perception
13 2023-11-18 link Structure-Aware Sparse-View X-Ray 3D Reconstruction
13 2023-03-24 link DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis
13 2023-11-30 link ElasticDiffusion: Training-Free Arbitrary Size Image Generation Through Global-Local Content
Separation
13 2024-03-05 link Sniffer: Multimodal Large Language Model for Explainable Out-of-Context Misinformation
Detection
13 2023-12-31 link Taming Mode Collapse in Score Distillation for Text-to-3D Generation
13 2024-03-29 link FairCLIP: Harnessing Fairness in Vision-Language Learning
13 2024-05-07 link Tactile-Augmented Radiance Fields
13 2023-11-30 link HOLD: Category-Agnostic 3D Reconstruction of Interacting Hands and Objects
from Video
13 2024-03-20 link DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data
Generation and Perception
13 2023-11-26 link Visual Programming for Zero-Shot Open-Vocabulary 3D Visual Grounding
13 2023-11-21 link Breathing Life Into Sketches Using Text-to-Video Priors
13 2024-03-27 link ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth
Estimation
13 2023-12-07 link Text-to-3D Generation with Bidirectional Diffusion Using Both 2D and
3D Priors
13 2023-11-30 link ChatPose: Chatting about 3D Human Pose
13 2024-03-19 link Fresco: Spatial-Temporal Correspondence for Zero-Shot Video Translation
13 2024-03-21 link CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-Spoofing
13 2023-11-22 link PIE-NeRF: Physics-Based Interactive Elastodynamics with NeRF
13 2024-01-02 link Towards a Simultaneous and Granular Identity-Expression Control in Personalized
Face Generation
13 2023-12-07 link Digital Life Project: Autonomous 3D Characters with Social Intelligence
13 2023-12-10 link AM-RADIO: Agglomerative Vision Foundation Model Reduce All Domains Into
One
13 2023-06-15 link Seeing the World through Your Eyes
13 2024-05-21 link Nearest is Not Dearest: Towards Practical Defense Against Quantization-Conditioned
Backdoor Attacks
13 2024-04-06 link Diffusion Time-step Curriculum for One Image to 3D Generation
13 2023-12-07 link MuRF: Multi-Baseline Radiance Fields
13 2023-12-11 link Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution
13 2023-03-23 link NOPE: Novel Object Pose Estimation from a Single Image
13 2023-12-05 link HybridNeRF: Efficient Neural Rendering via Adaptive Volumetric Surfaces
12 2023-08-19 link Noisy-Correspondence Learning for Text-to-Image Person Re-Identification
12 2023-11-28 link AvatarGPT: All-in-One Framework for Motion Understanding, Planning, Generation and
Beyond
12 2024-03-01 link Selective-Stereo: Adaptive Frequency Information Selection for Stereo Matching
12 2024-04-08 link PromptAD: Learning Prompts with only Normal Samples for Few-Shot
Anomaly Detection
12 2024-05-02 link Multi-Space Alignments Towards Universal LiDAR Segmentation
12 2023-01-22 link Summarize the Past to Predict the Future: Natural Language
Descriptions of Context Boost Multimodal Object Interaction Anticipation
12 2023-12-07 link ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations
12 2024-03-05 link Interactive Continual Learning: Fast and Slow Thinking
12 2023-12-28 link Amodal Ground Truth and Completion in the Wild
12 2023-11-27 link Evcap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for
Open-World Comprehension
12 2024-03-25 link VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation
12 2022-12-06 link Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis
12 2023-11-29 link One-Shot Open Affordance Learning with Foundation Models
12 2023-12-04 link Readout Guidance: Learning Control from Diffusion Features
12 2023-12-04 link PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric
Depth Estimation
12 2024-03-18 link HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data
12 2024-04-01 link Towards Memorization-Free Diffusion Models
12 2023-04-05 link VicTR: Video-conditioned Text Representations for Activity Recognition
12 2024-03-09 link RealNet: A Feature Selection Network with Realistic Synthetic Anomaly
for Anomaly Detection
12 2023-11-30 link DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars
12 2023-11-13 link Open-Vocabulary Video Anomaly Detection
12 2023-12-11 link DiffCast: A Unified Framework via Residual Diffusion for Precipitation
Nowcasting
12 2024-04-03 link On the Scalability of Diffusion-based Text-to-Image Generation
12 2024-02-08 link Question Aware Vision Transformer for Multimodal Reasoning
12 2024-03-12 link Beyond Text: Frozen Large Language Models in Visual Signal
Comprehension
12 2023-08-25 link Residual Denoising Diffusion Models
12 2024-03-05 link PromptKD: Unsupervised Prompt Distillation for Vision-Language Models
12 2023-11-30 link On Exact Inversion of DPM-Solvers
12 2024-03-07 link Depth-Aware Test-Time Training for Zero-Shot Video Object Segmentation
12 2024-03-17 link Selective Hourglass Mapping for Universal Image Restoration Based on
Diffusion Model
12 2024-03-07 link Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed
12 2024-01-09 link Pre-Trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness
12 2023-11-28 link Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation
12 2024-01-08 link MS-DETR: Efficient DETR Training with Mixed Supervision
11 2023-12-07 link Prompt Highlighter: Interactive Control for Multi-Modal LLMs
11 2024-03-05 link Multi-Modal Instruction Tuned LLMs with Fine-Grained Visual Perception
11 2024-03-26 link Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language
Models
11 2023-11-28 link Text-Driven Image Editing via Learnable Regions
11 2024-02-22 link CyberDemo: Augmenting Simulated Human Demonstration for Real-World Dexterous Manipulation
11 2024-04-29 link EMOPortraits: Emotion-Enhanced Multimodal One-Shot Head Avatars
11 2024-01-16 link TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding
11 2023-11-25 link Point Cloud Pre-Training with Diffusion Models
11 2024-03-28 link Mitigating Motion Blur in Neural Radiance Fields with Events
and Frames
11 2023-12-19 link Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image
Diffusion Models
11 2024-03-12 link PeLK: Parameter-Efficient Large Kernel ConvNets with Peripheral Convolution
11 2023-11-29 link Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models
11 2023-12-20 link SpecNeRF: Gaussian Directional Encoding for Specular Reflections
11 2024-01-15 link MaskClustering: View Consensus Based Mask Graph Clustering for Open-Vocabulary
3D Instance Segmentation
11 2023-09-14 link DePT: Decoupled Prompt Tuning
11 2024-04-09 link HPNet: Dynamic Trajectory Forecasting with Historical Prediction Attention
11 2024-06-06 link Matching Anything by Segmenting Anything
11 2023-11-27 link VIT-LENS: Towards Omni-modal Representations
11 2024-03-27 link Unleashing the Potential of SAM for Medical Adaptation via
Hierarchical Decoding
11 2024-01-18 link Towards Language-Driven Video Inpainting via Multimodal Large Language Models
11 2023-11-23 link PointOBB: Learning Oriented Object Detection via Single Point Supervision
11 2023-12-15 link Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing
by Attention Modulation
11 2023-04-02 link From Isolated Islands to Pangea: Unifying Semantic Space for
Human Action Understanding
11 2024-04-05 link Koala: Key Frame-Conditioned Long Video-LLM
11 2024-01-04 link Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions
11 2024-06-16 link VideoLLM-online: Online Video Large Language Model for Streaming Video
11 2023-11-30 link Can Protective Perturbation Safeguard Personal Data from Being Exploited
by Stable Diffusion?
11 2023-11-17 link High-fidelity Person-centric Subject-to-Image Synthesis
11 2024-04-11 link GoMAvatar: Efficient Animatable Human Modeling from Monocular Video Using
Gaussians-on-Mesh
11 2024-03-04 link Neural Redshift: Random Networks are not Random Functions
11 2024-03-17 link A Dual-Augmentor Framework for Domain Generalization in 3D Human
Pose Estimation
11 2024-04-22 link AutoAD III: The Prequel - Back to the Pixels
11 2024-02-29 link SeD: Semantic-Aware Discriminator for Image Super-Resolution
11 2023-11-29 link Continual Self-Supervised Learning: Towards Universal Multi-Modal Medical Data Representation
Learning
11 2024-02-13 link PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning
of Diffusion Models
11 2024-02-12 link Collaborative Semantic Occupancy Prediction with Hybrid Feature Fusion in
Connected Automated Vehicles
11 2023-10-26 link SD4Match: Learning to Prompt Stable Diffusion Model for Semantic
Matching
11 2024-01-16 link MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in
3D World
11 2024-01-16 link SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation
11 2023-11-18 link Implicit Event-RGBD Neural SLAM
11 2023-08-15 link Link-Context Learning for Multimodal LLMs
11 2024-02-28 link TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding
11 2023-12-31 link EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive
Masked Audio Gesture Modeling
11 2023-12-13 link See, Say, and Segment: Teaching LMMs to Overcome False
Premises
11 2024-01-04 link BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model
11 2023-11-23 link Posterior Distillation Sampling
11 2024-03-21 link EventDance: Unsupervised Source-Free Cross-Modal Adaptation for Event-Based Object Recognition
11 2023-12-12 link DiffMorpher: Unleashing the Capability of Diffusion Models for Image
Morphing
11 2024-02-27 link VoCo: A Simple-Yet-Effective Volume Contrastive Learning Framework for 3D
Medical Image Analysis
10 2024-03-24 link Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering
Refinement
10 2023-04-24 link End-to-End Spatio-Temporal Action Localisation with Video Transformers
10 2023-11-03 link HIPTrack: Visual Tracking with Historical Prompts
10 2023-12-05 link Alchemist: Parametric Control of Material Properties with Diffusion Models
10 2023-12-14 link Promptable Behaviors: Personalizing Multi-Objective Rewards from Human Preferences
10 2024-03-25 link RCBEVDet: Radar-Camera Fusion in Bird's Eye View for 3D
Object Detection
10 2023-11-28 link End-to-End Temporal Action Detection with 1B Parameters Across 1000
Frames
10 2023-12-19 link Mask Grounding for Referring Image Segmentation
10 2023-08-15 link Relightable and Animatable Neural Avatar from Sparse-View Video
10 2023-12-11 link Localization is All You Evaluate: Data Leakage in Online
Mapping Datasets and How to Fix it
10 2023-12-11 link MonoNPHM: Dynamic Head Reconstruction from Monocular Videos
10 2024-05-03 link On the Test-Time Zero-Shot Generalization of Vision-Language Models: Do
we Really need Prompt Learning?
10 2023-12-04 link ArtAdapter: Text-to-Image Style Transfer using Multi-Level Style Encoder and
Explicit Adaptation
10 2024-04-01 link MGMap: Mask-Guided Learning for Online Vectorized HD Map Construction
10 2024-05-15 link SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World
Knowledge
10 2023-05-24 link InNeRF360: Text-Guided 3D-Consistent Object Inpainting on 360° Neural Radiance
Fields
10 2023-12-06 link UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity
10 2024-03-22 link IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection
10 2024-06-05 link AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection
10 2023-12-01 link Segment and Caption Anything
10 2023-06-20 link CrossKD: Cross-Head Knowledge Distillation for Object Detection
10 2023-11-07 link 3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features
10 2023-12-20 link A Closer Look at the Few-Shot Adaptation of Large
Vision-Language Models
10 2023-12-28 link Unsupervised Universal Image Segmentation
10 2024-02-29 link PEM: Prototype-Based Efficient MaskFormer for Image Segmentation
10 2023-10-23 link MAS: Multi-view Ancestral Sampling for 3D Motion Generation Using
2D Diffusion
10 2024-04-10 link UDiFF: Generating Conditional Unsigned Distance Fields with Optimal Wavelet
Diffusion
10 2024-01-16 link Transcending the Limit of Local Window: Advanced Super-Resolution Transformer
with Adaptive Token Dictionary
10 2024-05-19 link Morphological Prototyping for Unsupervised Slide Representation Learning in Computational
Pathology
10 2023-12-20 link Improving Semantic Correspondence with Viewpoint-Guided Spherical Maps
10 2023-10-27 link ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image
10 2023-12-03 link Language-driven All-in-one Adverse Weather Removal
10 2023-11-23 link Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-End
Oriented Object Detection with Single Point Supervision
10 2023-12-04 link PaSCo: Urban 3D Panoptic Scene Completion with Uncertainty Awareness
10 2024-02-29 link CricaVPR: Cross-Image Correlation-Aware Representation Learning for Visual Place Recognition
10 2023-11-28 link As-Plausible-As-Possible: Plausibility-Aware Mesh Deformation Using 2D Diffusion Priors