Characterized Diffusion and Spatial-Temporal Interaction Network for Trajectory Prediction in Autonomous Driving (2024)

Haicheng Liao1Authors contributed equally; †Corresponding author.  Xuelin Li2  Yongkang Li2  Hanlin Kong2  Chengyue Wang1  
Bonan Wang1
  Yanchen Guan1  KaHou Tam1  Zhenning Li1∗†  Chengzhong Xu1
1University of Macau
2University of Electronic Science and Technology of China{yc27979, chengyuewang, mc3500, yc37976, yc374361, zhenningli, czxu}@um.edu.com,lxl.cooper@outlook.com,franklin1234560@163.com,hanlinkong@foxmail.com

Abstract

Trajectory prediction is a cornerstone in autonomous driving (AD), playing a critical role in enabling vehicles to navigate safely and efficiently in dynamic environments. To address this task, this paper presents a novel trajectory prediction model tailored for accuracy in the face of heterogeneous and uncertain traffic scenarios. At the heart of this model lies the Characterized Diffusion Module, an innovative module designed to simulate traffic scenarios with inherent uncertainty. This module enriches the predictive process by infusing it with detailed semantic information, thereby enhancing trajectory prediction accuracy. Complementing this, our Spatio-Temporal (ST) Interaction Module captures the nuanced effects of traffic scenarios on vehicle dynamics across both spatial and temporal dimensions with remarkable effectiveness. Demonstrated through exhaustive evaluations, our model sets a new standard in trajectory prediction, achieving state-of-the-art (SOTA) results on the Next Generation Simulation (NGSIM), Highway Drone (HighD), and Macao Connected Autonomous Driving (MoCAD) datasets across both short and extended temporal spans. This performance underscores the model’s unparalleled adaptability and efficacy in navigating complex traffic scenarios, including highways, urban streets, and intersections.

Characterized Diffusion and Spatial-Temporal Interaction Network for Trajectory Prediction in Autonomous Driving (1)

1 Introduction

In the domain of autonomous driving (AD), trajectory prediction plays a pivotal role by providing invaluable insights for the subsequent trajectory planning module, thereby enhancing the safety of navigation in complex and dynamic traffic scenarios Huang et al. (2022). The continuous presence of mixed traffic flow necessitates a trajectory prediction model that deeply understands the heterogeneity and uncertainty of traffic scenarios Liao et al. (2024e). Despite the proliferation of trajectory prediction models, significant gaps remain in the thorough investigation of the impact of heterogeneous and uncertain traffic scenarios on future motion.

The initial gap we pinpoint hinges on the accurate simulation of future traffic scenarios—a cornerstone for enhancing trajectory prediction precision Wang et al. (2023); Liao et al. (2024d). The challenge is amplified by the intrinsic uncertainties characterizing traffic dynamics, making the accurate forecast of future scenarios a complex endeavor. Prevailing models have primarily concentrated on uncertainties inherent to the target agent Zhao et al. (2019); Alahi et al. (2016); Gupta et al. (2018), thereby neglecting the comprehensive uncertainty pervasive in the overall traffic scenarios. This oversight highlights an imperative need for trajectory prediction frameworks to adeptly navigate and mitigate uncertainties, facilitating a more accurate and holistic simulation of future traffic scenarios.

The second identified gap related to the sophisticated mechanisms by which traffic scenarios exert influence on human driving behaviors. The decision-making processes of human drivers are profoundly shaped by their interactions with other traffic agents, with such interactions predicated on a nuanced interplay between spatial and temporal dimensions. Nevertheless, prevailing models have primarily focused on capturing spatial interactions, starkly overlooking the critical temporal dynamics Wang et al. (2023). This neglect highlights the critical necessity for models that adeptly incorporate both spatial and temporal interaction.

To address these challenges, we introduce a novel generative model, CDSTraj, which is built on a dual architecture. As shown in Fig. 1. Our model employs an encoder designed to generate spatial-temporal features from past states. while significantly fusing confidence features to ensure the stability and reliability of the representation. This enriched feature set serves as the foundational input for a decoder that generates future trajectory predictions. A key innovation in our approach is the introduction of a characterized diffusion, which is seamlessly integrated with a spatial-temporal interaction network. This synergistic combination allows our model to be aware of the indeterminacy associated with both scene-to-agent and agent-to-agent contexts. Consequently, this leads to trajectory predictions that are both more accurate and reliable, even in dynamically changing environments. Overall, our main contributions can be summarized as follows:

  • We introduce the Characterized Diffusion Module, a novel approach that enhances trajectory prediction by dynamically simulating future traffic scenarios through iterative uncertainty mitigation. This module significantly augments the predictive accuracy by integrating complex, contextual scenario features, allowing for a more nuanced understanding of potential motion.

  • We unveil the Spatial-Temporal Interaction Module, which leverages a spatio-temporal attention mechanism to meticulously model and analyze the intricate interactions characteristic of traffic scenarios. Unique to this module is its three-stage architecture, meticulously designed to efficiently capture and process information across spatial and temporal dimensions.

  • Our rigorous empirical investigations underscore the superiority of our model over the existing trajectory prediction model. Through extensive experiments, our model achieves top performance on public datasets such as NGSIM, HighD, and MoCAD. The exceptional performance on the MoCAD dataset is especially significant, offering a fresh perspective for evaluation through its unique right-hand drive configuration and obligatory left-hand traffic regime, thus underscoring our model’s adaptability and accuracy in varying driving scenarios.

2 RELATED WORKS

Trajectory Prediction for Autonomous Driving.Early trajectory prediction methods primarily relied on manual feature engineering and rule-based techniques, including linear regression and Kalman filters Prevost et al. (2007). These methods were limited in capturing complex interactions in dynamic environments. The field evolved significantly with the introduction of deep learning, specifically Recurrent Neural Networks (RNNs) Kim et al. (2017) and Long Short-Term Memory (LSTM) networks Altché and de LaFortelle (2017); Alahi et al. (2016); Liao et al. (2024b). These advancements enabled the capture of temporal dependencies in trajectories. Further innovation came with Graph Neural Networks (GNNs) Zhou et al. (2021); Liao et al. (2024a, c), which provided a more nuanced approach to modeling interactions among agents in crowded scenes.

Generative Models for Trajectory Prediction.Generative models like Generative Adversarial Networks (GANs) Gupta et al. (2018) and Variational Auto-Encoders (VAEs) Lee et al. (2017) have gained prominence in trajectory prediction. GANs involve a generator and a discriminator engaged in mutual learning, while VAEs use a generative model and a variational posterior, the optimization of which can be complex. Diffusion models, on the other hand, offer a simplified training process by focusing on matching forward and inverse diffusion processes. To the best of our knowledge, this work is the first to leverage diffusion models for capturing the confidence features.

Characterized Diffusion and Spatial-Temporal Interaction Network for Trajectory Prediction in Autonomous Driving (2)

Denoising Diffusion Probabilistic Models.Denoising Diffusion Probabilistic Models (DDPM) Ho et al. (2020), known as diffusion models, have gained prominence as powerful generative models for various applications, including image Ramesh et al. (2022); Rombach et al. (2022); Liao et al. (2024f), video Ho et al. (2022), and 3D shapePoole et al. (2022) generation. Inspired by the diffusion models’ enormous representation capacities in numerous generation tasks, our work introduces the novel application of diffusion models to trajectory prediction in autonomous driving, addressing the challenges of modeling uncertainties and complex agent interactions in dynamic environments.

3 Problem Formulation

The paramount objective of this study is the precise prediction of trajectories for all entities within the proximity of an autonomous vehicle (AV) situated in an environment characterized by mixed autonomy. For this purpose, every entity proximal to the AV is designated as a target agent. At a specific time tcsubscript𝑡𝑐t_{c}italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, our model endeavors to utilize the historical states of both the target agent and its neighboring agents to predict the future trajectory of the target agent, represented as 𝒀0subscript𝒀0\bm{Y}_{0}bold_italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, extending to a future time tc+tfsubscript𝑡𝑐subscript𝑡𝑓t_{c}+t_{f}italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. The historical state since the time tcthsubscript𝑡𝑐subscript𝑡t_{c}-t_{h}italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is denoted by 𝑿0subscript𝑿0\bm{X}_{0}bold_italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for the target agent and 𝑿isubscript𝑿𝑖\bm{X}_{i}bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for the neighboring agents.

The novelty of our model lies in its exploitation of anticipated future traffic scenarios, specifically the future trajectories of neighboring agents, to enhance the accuracy of trajectory prediction for the target agent. To this end, we develop an Characterized Diffusion Module, designed to systematically mitigate the uncertainty inherent in the trajectories of neighboring agents, thereby enabling the accurate prediction of their future trajectories 𝒀isubscript𝒀𝑖\bm{Y}_{i}bold_italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Formally, our prediction model ΦΦ\Phiroman_Φ is represented as:

𝒀0=Φ(𝑿𝟎,𝑿𝒊,𝒀𝒊)i[1,n]subscript𝒀0Φsubscript𝑿0subscript𝑿𝒊subscript𝒀𝒊for-all𝑖1𝑛\bm{Y}_{0}=\Phi(\bm{X_{0}},\bm{X_{i}},\bm{Y_{i}})\ \;\forall i\in[1,n]bold_italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_Φ ( bold_italic_X start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT , bold_italic_Y start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) ∀ italic_i ∈ [ 1 , italic_n ](1)

4 Methodology

4.1 Overview of the Model Framework

As illustrated in Figure 2, our model comprises three primary components: the Characterized Diffusion Module, the ST Interaction Module, and a Multi-modal Decoder. The Characterized Diffusion Module employs an inverse diffusion process to generate the future trajectories of neighboring agents. Concurrently, the ST Interaction Module extracts Spatial-Temporal Interaction features through a methodical alternation between spatial and temporal dimensions. Ultimately, the predicted trajectories are generated using a multi-modal decoder, which synthesizes the processed information to produce accurate trajectory prediction.

4.2 Characterized Diffusion

In order to predict the future trajectories of neighboring agents, the Characterized Diffusion Module treats trajectory prediction as the reverse process of motion Characterized Diffusion and gradually eliminates the uncertainty of future trajectories by learning a parameterized Markov chain with the observed historical states. More specifically, during the diffusion process, the uncertainty inherent in future trajectories is simulated by iteratively introducing Gaussian noise. Conversely, in the inverse diffusion process, this uncertainty is iteratively mitigated to derive the anticipated future trajectories accurately. The detailed procedure is shown in Fig. 3. Mathematically, let C be the future trajectory of the neighboring agents. Firstly, We initialize the diffused unit C0superscriptC0\textbf{C}^{0}C start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT:

C0=CsuperscriptC0C\textbf{C}^{0}=\textbf{C}C start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = C(2)
Characterized Diffusion and Spatial-Temporal Interaction Network for Trajectory Prediction in Autonomous Driving (3)

We use a forward diffusion operation fdiffuse()subscript𝑓diffusef_{\textit{diffuse}}(\cdot)italic_f start_POSTSUBSCRIPT diffuse end_POSTSUBSCRIPT ( ⋅ ) to add uncertainty to Cδ1superscriptC𝛿1\textbf{C}^{\delta-1}C start_POSTSUPERSCRIPT italic_δ - 1 end_POSTSUPERSCRIPT and transition to diffused unit CδsuperscriptC𝛿\textbf{C}^{\delta}C start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT:

Cδ=fdiffuse(Cδ1),δ=1,,Γformulae-sequencesuperscriptC𝛿subscript𝑓diffusesuperscriptC𝛿1𝛿1Γ\textbf{C}^{\delta}=f_{\textit{diffuse}}(\textbf{C}^{\delta-1}),\delta=1,...,\GammaC start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT diffuse end_POSTSUBSCRIPT ( C start_POSTSUPERSCRIPT italic_δ - 1 end_POSTSUPERSCRIPT ) , italic_δ = 1 , … , roman_Γ(3)

where CδsuperscriptC𝛿\textbf{C}^{\delta}C start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT is the diffused unit at the δthsuperscript𝛿𝑡\delta^{th}italic_δ start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT diffusion step, ΓΓ\Gammaroman_Γ is the total number of steps.

After n𝑛nitalic_n iterations, our model is able to capture a comprehensive spectrum of uncertain traffic scenarios with maximum coverage. Then, the inverse diffusion process is applied to accurately derive the future trajectories of neighboring vehicles. This step-by-step refinement ensures high fidelity in predicting vehicle movements, effectively addressing the inherent complexity of dynamic traffic scenarios. Due to the indeterminacy of future trajectories, in the denoising procedure, it is usually more reliable to capture more than one inverse unit to extract sufficient trajectory information. Therefore, we draw K𝐾Kitalic_K independent and identically distributed samples to initialize the denoising unit 𝐂^kΓsuperscriptsubscript^𝐂𝑘Γ\widehat{\mathbf{C}}_{k}^{\Gamma}over^ start_ARG bold_C end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Γ end_POSTSUPERSCRIPT from a normal distribution.

𝐂^kΓi.i.d𝒫(𝐂^Γ)=𝒩(𝐂^Γ;𝟎,𝐈),sampleKtimes,formulae-sequencesuperscriptsimilar-toformulae-sequence𝑖𝑖𝑑superscriptsubscript^𝐂𝑘Γ𝒫superscript^𝐂Γ𝒩superscript^𝐂Γ0𝐈sample𝐾times,\widehat{\mathbf{C}}_{k}^{\Gamma}\stackrel{{\scriptstyle i.i.d}}{{\sim}}%\mathcal{P}\left(\widehat{\mathbf{C}}^{\Gamma}\right)=\mathcal{N}\left(%\widehat{\mathbf{C}}^{\Gamma};\mathbf{0},\mathbf{I}\right),\text{ sample }K%\text{ times, }over^ start_ARG bold_C end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Γ end_POSTSUPERSCRIPT start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG italic_i . italic_i . italic_d end_ARG end_RELOP caligraphic_P ( over^ start_ARG bold_C end_ARG start_POSTSUPERSCRIPT roman_Γ end_POSTSUPERSCRIPT ) = caligraphic_N ( over^ start_ARG bold_C end_ARG start_POSTSUPERSCRIPT roman_Γ end_POSTSUPERSCRIPT ; bold_0 , bold_I ) , sample italic_K times,(4)

We formulate the trajectories generation process as a reverse diffusion by iteratively applying a denoising operation fdenoise()subscript𝑓denoisef_{\textit{denoise}}(\cdot)italic_f start_POSTSUBSCRIPT denoise end_POSTSUBSCRIPT ( ⋅ ) to obtain the denoised unit 𝐂^kδsuperscriptsubscript^𝐂𝑘𝛿\widehat{\mathbf{C}}_{k}^{\delta}over^ start_ARG bold_C end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT conditioned on historical states X0subscriptX0\textbf{X}_{0}X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, XisubscriptX𝑖\textbf{X}_{i}X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and, unit 𝐂^kδ+1superscriptsubscript^𝐂𝑘𝛿1\widehat{\mathbf{C}}_{k}^{\delta+1}over^ start_ARG bold_C end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ + 1 end_POSTSUPERSCRIPT:

𝐂^kδ=fdenoise(𝐂^kδ+1,𝐗0,Xi),δ=Γ1,,0,formulae-sequencesuperscriptsubscript^𝐂𝑘𝛿subscript𝑓denoisesuperscriptsubscript^𝐂𝑘𝛿1subscript𝐗0subscriptX𝑖𝛿Γ10\widehat{\mathbf{C}}_{k}^{\delta}=f_{\textit{denoise }}\left(\widehat{\mathbf{%C}}_{k}^{\delta+1},\mathbf{X}_{0},\textbf{X}_{i}\right),\delta=\Gamma-1,\cdots%,0,over^ start_ARG bold_C end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT denoise end_POSTSUBSCRIPT ( over^ start_ARG bold_C end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ + 1 end_POSTSUPERSCRIPT , bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_δ = roman_Γ - 1 , ⋯ , 0 ,(5)

In a denoising module, two parts are trainable: a transformer-based context encoder fcontext()subscript𝑓contextf_{\textit{context}}(\cdot)italic_f start_POSTSUBSCRIPT context end_POSTSUBSCRIPT ( ⋅ ) to learn a social-temporal embedding and an uncertainty estimation module fϵ()subscript𝑓italic-ϵf_{\epsilon}(\cdot)italic_f start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( ⋅ ) to estimate the uncertainty to reduce. Mathematically, the δthsuperscript𝛿𝑡\delta^{th}italic_δ start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT denoising step works as follows:

𝐂encoder=fcontext(𝐗0,𝕏i)subscript𝐂encodersubscript𝑓contextsubscript𝐗0subscript𝕏𝑖\mathbf{C}_{\textit{encoder}}=f_{\textit{context}}\left(\mathbf{X}_{0},\mathbb%{X}_{i}\right)bold_C start_POSTSUBSCRIPT encoder end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT context end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , blackboard_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )(6)
ϵθδ=fϵ(𝐂^kδ+1,𝐂encoder,δ+1)superscriptsubscriptbold-italic-ϵ𝜃𝛿subscript𝑓bold-italic-ϵsuperscriptsubscript^𝐂𝑘𝛿1subscript𝐂encoder𝛿1\bm{\epsilon}_{\theta}^{\delta}=f_{\bm{\epsilon}}\left(\widehat{\mathbf{C}}_{k%}^{\delta+1},\mathbf{C}_{\textit{encoder}},\delta+1\right)bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT bold_italic_ϵ end_POSTSUBSCRIPT ( over^ start_ARG bold_C end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ + 1 end_POSTSUPERSCRIPT , bold_C start_POSTSUBSCRIPT encoder end_POSTSUBSCRIPT , italic_δ + 1 )(7)
𝐂^kδ=1αδ(𝐂^kδ+11αδ1α¯δϵθδ)+1αδ𝐳superscriptsubscript^𝐂𝑘𝛿1subscript𝛼𝛿superscriptsubscript^𝐂𝑘𝛿11subscript𝛼𝛿1subscript¯𝛼𝛿superscriptsubscriptbold-italic-ϵ𝜃𝛿1subscript𝛼𝛿𝐳\widehat{\mathbf{C}}_{k}^{\delta}=\frac{1}{\sqrt{\alpha_{\delta}}}\left(%\widehat{\mathbf{C}}_{k}^{\delta+1}-\frac{1-\alpha_{\delta}}{\sqrt{1-\bar{%\alpha}_{\delta}}}\bm{\epsilon}_{\theta}^{\delta}\right)+\sqrt{1-\alpha_{%\delta}}\mathbf{z}over^ start_ARG bold_C end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT end_ARG end_ARG ( over^ start_ARG bold_C end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ + 1 end_POSTSUPERSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT end_ARG end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT ) + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT end_ARG bold_z(8)

where αδsubscript𝛼𝛿\alpha_{\delta}italic_α start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT and α¯δ=i=1δαisubscript¯𝛼𝛿superscriptsubscriptproduct𝑖1𝛿subscript𝛼𝑖\bar{\alpha}_{\delta}=\prod_{i=1}^{\delta}\alpha_{i}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are parameters in the diffusion process and 𝐳𝒩(𝐳;𝟎,𝐈)similar-to𝐳𝒩𝐳0𝐈\mathbf{z}\sim\mathcal{N}(\mathbf{z};\mathbf{0},\mathbf{I})bold_z ∼ caligraphic_N ( bold_z ; bold_0 , bold_I ) is the uncertainty. We leverage a context encoder fcontext()subscript𝑓contextf_{\text{context }}(\cdot)italic_f start_POSTSUBSCRIPT context end_POSTSUBSCRIPT ( ⋅ ) on historical states (𝐗,𝕏𝒩)𝐗subscript𝕏𝒩\left(\mathbf{X},\mathbb{X}_{\mathcal{N}}\right)( bold_X , blackboard_X start_POSTSUBSCRIPT caligraphic_N end_POSTSUBSCRIPT ) to obtain the context condition 𝐂encodersubscript𝐂𝑒𝑛𝑐𝑜𝑑𝑒𝑟\mathbf{C}_{encoder}bold_C start_POSTSUBSCRIPT italic_e italic_n italic_c italic_o italic_d italic_e italic_r end_POSTSUBSCRIPT, and estimates the uncertainty ϵθδsuperscriptsubscriptitalic-ϵ𝜃𝛿\epsilon_{\theta}^{\delta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT in the uncertain trajectory 𝐘^kδ+1superscriptsubscript^𝐘𝑘𝛿1\widehat{\mathbf{Y}}_{k}^{\delta+1}over^ start_ARG bold_Y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ + 1 end_POSTSUPERSCRIPT through uncertainty estimation fϵ()subscript𝑓italic-ϵf_{\epsilon}(\cdot)italic_f start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( ⋅ ) implemented by multi-layer perceptions with the context C; Eq. 7 provides a standard denoising step.

The final K𝐾{K}italic_K extracted units are 𝐂^={𝐂^10,𝐂^20,,𝐂^K0}.^𝐂superscriptsubscript^𝐂10superscriptsubscript^𝐂20superscriptsubscript^𝐂𝐾0\widehat{\mathbf{C}}=\{\widehat{\mathbf{C}}_{1}^{0},\widehat{\mathbf{C}}_{2}^{%0},...,\widehat{\mathbf{C}}_{K}^{0}\}.over^ start_ARG bold_C end_ARG = { over^ start_ARG bold_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , over^ start_ARG bold_C end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , … , over^ start_ARG bold_C end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT } . Finally, we obtain the future trajectories with:

Yi=Ω(𝐂^,Wcf)subscriptY𝑖Ω^𝐂subscript𝑊𝑐𝑓\textbf{Y}_{i}=\Omega(\widehat{\mathbf{C}},W_{cf})Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_Ω ( over^ start_ARG bold_C end_ARG , italic_W start_POSTSUBSCRIPT italic_c italic_f end_POSTSUBSCRIPT )(9)

where ΩΩ\Omegaroman_Ω denotes a multi-layer perception with learnable parameter matrix Wcfsubscript𝑊cfW_{\textit{cf}}italic_W start_POSTSUBSCRIPT cf end_POSTSUBSCRIPT.

4.3 Spatial and Temporal Interaction

To enhance the precision of modeling the temporal and spatial dynamics of vehicle interactions within the environment, the Space-Time (ST) Interaction Module is designed with a novel structure that alternates between temporal and spatial dimensions. This module is composed of three key components, as illustrated in Fig. 2: 1. Temporal Encoder: where the temporal dependencies of all the agents are extracted from the historical states by a temporal encoder. 2. Spatial Encoder:, which plays an essential role in extracting the spatial relation between the target agent and neighboring agents. 3. ST Fusion: which aims to deeply capture the spatial-temporal interaction.

1) Temporal Encoder: To begin with, a temporal embedding vector Ftsuperscript𝐹𝑡F^{t}italic_F start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is obtained from the historical states xtsuperscript𝑥𝑡x^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT at the tthsuperscript𝑡𝑡t^{th}italic_t start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT timestamp using a fully connected layer with a learnable parameter matrix Wembsubscript𝑊embW_{\textit{emb}}italic_W start_POSTSUBSCRIPT emb end_POSTSUBSCRIPT as follows:

Ft=δ(ϕ(xt,Wemb))superscript𝐹𝑡𝛿italic-ϕsuperscript𝑥𝑡subscript𝑊𝑒𝑚𝑏F^{t}=\delta(\phi(x^{t},W_{emb}))italic_F start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_δ ( italic_ϕ ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT ) )(10)

where δ𝛿\deltaitalic_δ is the fully connected layer and ΦΦ\Phiroman_Φ is the LeakyReLU activation function. It can be defined as follows:

ht=ftem(Ft,ht1,Winit)superscript𝑡subscript𝑓temsuperscript𝐹𝑡superscript𝑡1subscript𝑊𝑖𝑛𝑖𝑡h^{t}=f_{\textit{tem}}(F^{t},h^{t-1},W_{init})italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT tem end_POSTSUBSCRIPT ( italic_F start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT )(11)

where Winitsubscript𝑊𝑖𝑛𝑖𝑡W_{init}italic_W start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT denotes the learnable parameter matrix of encoder ftemsubscript𝑓temf_{\textit{tem}}italic_f start_POSTSUBSCRIPT tem end_POSTSUBSCRIPT and htsuperscript𝑡h^{t}italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT denotes the temporal feature at the tthsuperscript𝑡𝑡t^{th}italic_t start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT timestamp, which is updated at each timestamp based on the hidden state at the previous timestamp and the embedding vector at the current timestamp. Here, we apply ftemsubscript𝑓temf_{\textit{tem}}italic_f start_POSTSUBSCRIPT tem end_POSTSUBSCRIPT to every single agent shared with the parameters of encoding ftemsubscript𝑓temf_{\textit{tem}}italic_f start_POSTSUBSCRIPT tem end_POSTSUBSCRIPT to reduce variation across numerous agents at different timestamps. Finally, for the target agent, we obtain H0=[h0Tp+1,h0Tp+2,,h00]RTp×Dsubscript𝐻0subscriptsuperscriptsubscript𝑇𝑝10subscriptsuperscriptsubscript𝑇𝑝20subscriptsuperscript00superscript𝑅subscript𝑇𝑝𝐷H_{0}=[h^{-T_{p}+1}_{0},h^{-T_{p}+2}_{0},...,h^{0}_{0}]\in R^{T_{p}\times D}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = [ italic_h start_POSTSUPERSCRIPT - italic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_h start_POSTSUPERSCRIPT - italic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_h start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] ∈ italic_R start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT representing the temporal feature over Tpsubscript𝑇𝑝T_{p}italic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT timestamps, and H¯i=[h¯iTp+1,h¯iTp+2,,h¯i0]RTp×Dsubscript¯𝐻𝑖subscriptsuperscript¯subscript𝑇𝑝1𝑖subscriptsuperscript¯subscript𝑇𝑝2𝑖subscriptsuperscript¯0𝑖superscript𝑅subscript𝑇𝑝𝐷\bar{H}_{i}=[\bar{h}^{-T_{p}+1}_{i},\bar{h}^{-T_{p}+2}_{i},...,\bar{h}^{0}_{i}%]\in R^{T_{p}\times D}over¯ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ over¯ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT - italic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over¯ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT - italic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , over¯ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ∈ italic_R start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT denoting the temporal feature of ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT neighboring agents and D𝐷Ditalic_D denotes the number of hidden dimensions.

2) Spatial Encoder: Apparently, naive use of one temporal feature per agent does not capture agent-to-agent spatial relations. Therefore, capturing the spatial relations between agents in the same scene is highly necessary. Given the success of attention mechanisms in sequence-based prediction Hu (2020), we adopt a multi-head attention mechanism to obtain spatial relations between agents, which can be represented as follows:

Q,K,V=fsp(H,H¯,Wq,Wk,Wv)𝑄𝐾𝑉subscript𝑓𝑠𝑝𝐻¯𝐻subscript𝑊𝑞subscript𝑊𝑘subscript𝑊𝑣Q,K,V=f_{sp}(H,\bar{H},W_{q},W_{k},W_{v})italic_Q , italic_K , italic_V = italic_f start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT ( italic_H , over¯ start_ARG italic_H end_ARG , italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT )(12)

where fspsubscript𝑓𝑠𝑝f_{sp}italic_f start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT is the spatial attention.

In a detailed elaboration, Q=[qTp+1,qTp+2,,q0]𝑄superscript𝑞subscript𝑇𝑝1superscript𝑞subscript𝑇𝑝2superscript𝑞0Q=[{q^{-T_{p}+1},q^{-T_{p}+2},...,q^{0}}]italic_Q = [ italic_q start_POSTSUPERSCRIPT - italic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 end_POSTSUPERSCRIPT , italic_q start_POSTSUPERSCRIPT - italic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 2 end_POSTSUPERSCRIPT , … , italic_q start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ], K=[kTp+1,kTp+2,,k0]𝐾superscript𝑘subscript𝑇𝑝1superscript𝑘subscript𝑇𝑝2superscript𝑘0K=[k^{-T_{p}+1},k^{-T_{p}+2},...,k^{0}]italic_K = [ italic_k start_POSTSUPERSCRIPT - italic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT - italic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 2 end_POSTSUPERSCRIPT , … , italic_k start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ], V=[vTp+1,vTp+2,,v0]𝑉superscript𝑣subscript𝑇𝑝1superscript𝑣subscript𝑇𝑝2superscript𝑣0V=[v^{-T_{p}+1},v^{-T_{p}+2},...,v^{0}]italic_V = [ italic_v start_POSTSUPERSCRIPT - italic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT - italic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 2 end_POSTSUPERSCRIPT , … , italic_v start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] respectively denote linear projected vectors query, key, value and Wqsubscript𝑊𝑞W_{q}italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT,Wksubscript𝑊𝑘W_{k}italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT,Wvsubscript𝑊𝑣W_{v}italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT indicate three learnable parameter matrices. We apply normalization ΠΠ\Piroman_Π to query and key to represent the importance of agents influencing each other. Formally,

ω=Π((QK)Dinit)𝜔Π𝑄𝐾subscript𝐷𝑖𝑛𝑖𝑡\omega=\Pi(\frac{(Q\cdot K)}{\sqrt{D_{init}}})italic_ω = roman_Π ( divide start_ARG ( italic_Q ⋅ italic_K ) end_ARG start_ARG square-root start_ARG italic_D start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT end_ARG end_ARG )(13)

where ω𝜔\omegaitalic_ω is the attention score representing the similarity between Q𝑄Qitalic_Q and K𝐾Kitalic_K,()(\cdot)( ⋅ ) denotes the operation of matrix multiplication. Subsequently, we leverage attention scores to discern the significant connections among agents. Mathematically,

Υ=ωVΥ𝜔𝑉\Upsilon=\omega\cdot Vroman_Υ = italic_ω ⋅ italic_V(14)

where υ𝜐\upsilonitalic_υ is the output of the single-head attention. Compared to single-head, multi-head attention can comprehensively capture local and global spatial relations. Therefore, we apply the multi-head attention mechanism to obtain Υ=[υ1,υ2,,υn]Υsubscript𝜐1subscript𝜐2subscript𝜐𝑛\Upsilon=[\upsilon_{1},\upsilon_{2},...,\upsilon_{n}]roman_Υ = [ italic_υ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_υ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_υ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ], carrying spatial relations. We introduce an innovative gating mechanism Hgsubscript𝐻𝑔H_{g}italic_H start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT to control the importance of different heads and selectively amplify or suppress specific heads. This mechanism acts as a gatekeeper for the ΥΥ\Upsilonroman_Υ by adjusting the activation level consisting of two linear layers as:

Ha=κ(Υ),Hg=σ(κ(Υ)),S=HaHgformulae-sequencesubscript𝐻𝑎𝜅Υformulae-sequencesubscript𝐻𝑔𝜎𝜅Υ𝑆direct-productsubscript𝐻𝑎subscript𝐻𝑔H_{a}=\kappa(\Upsilon),\,\,H_{g}=\sigma(\kappa(\Upsilon)),\,\,S=H_{a}\odot H_{g}italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_κ ( roman_Υ ) , italic_H start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = italic_σ ( italic_κ ( roman_Υ ) ) , italic_S = italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⊙ italic_H start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT(15)

where σ𝜎\sigmaitalic_σ denotes the activation function sigmoid and direct-product\odot denotes the multiplication of the corresponding elements of the matrix, κ𝜅\kappaitalic_κ denotes the linear layer. Note that the output of the spatial encoder we simplify as follows:

S=[sTp+1,sTp+2,,s0]RTp×D𝑆superscript𝑠subscript𝑇𝑝1superscript𝑠subscript𝑇𝑝2superscript𝑠0superscript𝑅subscript𝑇𝑝𝐷S=[s^{-T_{p}+1},s^{-T_{p}+2},...,s^{0}]\in R^{T_{p}\times D}italic_S = [ italic_s start_POSTSUPERSCRIPT - italic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT - italic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 2 end_POSTSUPERSCRIPT , … , italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] ∈ italic_R start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT(16)

3) ST Fusion: Following the spatial encoder, we introduce a ST Fusion fSTsubscript𝑓STf_{\textit{ST}}italic_f start_POSTSUBSCRIPT ST end_POSTSUBSCRIPT to deeply capture spatial-interaction from S𝑆Sitalic_S produced by the preceding module. Formally,

Q¯,K¯,V¯=fST(S,W¯q,W¯k,W¯v)¯𝑄¯𝐾¯𝑉subscript𝑓ST𝑆subscript¯𝑊𝑞subscript¯𝑊𝑘subscript¯𝑊𝑣\bar{Q},\bar{K},\bar{V}=f_{\textit{ST}}(S,\bar{W}_{q},\bar{W}_{k},\bar{W}_{v})over¯ start_ARG italic_Q end_ARG , over¯ start_ARG italic_K end_ARG , over¯ start_ARG italic_V end_ARG = italic_f start_POSTSUBSCRIPT ST end_POSTSUBSCRIPT ( italic_S , over¯ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , over¯ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over¯ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT )(17)

where Q¯=[q¯1t,q¯2t,,q¯Mt]¯𝑄subscriptsuperscript¯𝑞𝑡1subscriptsuperscript¯𝑞𝑡2subscriptsuperscript¯𝑞𝑡𝑀\bar{Q}=[\bar{q}^{t}_{1},\bar{q}^{t}_{2},...,\bar{q}^{t}_{M}]over¯ start_ARG italic_Q end_ARG = [ over¯ start_ARG italic_q end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over¯ start_ARG italic_q end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , over¯ start_ARG italic_q end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ], K¯=[k¯1t,k¯2t,,k¯Mt]¯𝐾subscriptsuperscript¯𝑘𝑡1subscriptsuperscript¯𝑘𝑡2subscriptsuperscript¯𝑘𝑡𝑀\bar{K}=[\bar{k}^{t}_{1},\bar{k}^{t}_{2},...,\bar{k}^{t}_{M}]over¯ start_ARG italic_K end_ARG = [ over¯ start_ARG italic_k end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over¯ start_ARG italic_k end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , over¯ start_ARG italic_k end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ], V¯=[v¯1t,v¯2t,,v¯Mt]¯𝑉subscriptsuperscript¯𝑣𝑡1subscriptsuperscript¯𝑣𝑡2subscriptsuperscript¯𝑣𝑡𝑀\bar{V}=[\bar{v}^{t}_{1},\bar{v}^{t}_{2},...,\bar{v}^{t}_{M}]over¯ start_ARG italic_V end_ARG = [ over¯ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over¯ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , over¯ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ] respectively denote linear projected vectors query, key, value at the tthsuperscript𝑡𝑡t^{th}italic_t start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT timestamp, and W¯qsubscript¯𝑊𝑞\bar{W}_{q}over¯ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT,W¯ksubscript¯𝑊𝑘\bar{W}_{k}over¯ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT,W¯vsubscript¯𝑊𝑣\bar{W}_{v}over¯ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT indicate three learnable parameter matrices. In analogy with the spatial interaction module, we also use normalization and gating mechanisms to obtain long-term temporal features simplified as follows:

U=[uTp+1,uTp+2,,u0]RTp×D𝑈superscript𝑢subscript𝑇𝑝1superscript𝑢subscript𝑇𝑝2superscript𝑢0superscript𝑅subscript𝑇𝑝𝐷U=[u^{-T_{p}+1},u^{-T_{p}+2},...,u^{0}]\in R^{T_{p}\times D}italic_U = [ italic_u start_POSTSUPERSCRIPT - italic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT - italic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 2 end_POSTSUPERSCRIPT , … , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] ∈ italic_R start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT(18)

4.4 Decoder

This study defines the trajectory prediction task as a conditional probabilistic prediction problem. Specifically, the decoder is designed to predict the future trajectory for the target agent based on different lateral and longitudinal maneuver classes:

P(Y^)=P(Y|Plat,Plon)PlatPlon𝑃^𝑌𝑃conditional𝑌subscript𝑃𝑙𝑎𝑡subscript𝑃𝑙𝑜𝑛subscript𝑃𝑙𝑎𝑡subscript𝑃𝑙𝑜𝑛P(\hat{Y})=P(Y|P_{lat},P_{lon})\cdot P_{lat}\cdot P_{lon}italic_P ( over^ start_ARG italic_Y end_ARG ) = italic_P ( italic_Y | italic_P start_POSTSUBSCRIPT italic_l italic_a italic_t end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_l italic_o italic_n end_POSTSUBSCRIPT ) ⋅ italic_P start_POSTSUBSCRIPT italic_l italic_a italic_t end_POSTSUBSCRIPT ⋅ italic_P start_POSTSUBSCRIPT italic_l italic_o italic_n end_POSTSUBSCRIPT(19)

where P(Y^)𝑃^𝑌P({\hat{Y}})italic_P ( over^ start_ARG italic_Y end_ARG ) is the conditional probabilistic distribution for the predicted trajectory. In detail, we use the LSTM decoder to implement the final multi-modal trajectory prediction.

yt^=fLSTM(F,y^t1,Wdecoder)^superscript𝑦𝑡subscript𝑓LSTM𝐹superscript^𝑦𝑡1subscript𝑊decoder\hat{y^{t}}=f_{\textit{LSTM}}(F,\hat{y}^{t-1},W_{\textit{decoder}})over^ start_ARG italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG = italic_f start_POSTSUBSCRIPT LSTM end_POSTSUBSCRIPT ( italic_F , over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT decoder end_POSTSUBSCRIPT )(20)

where y^tsuperscript^𝑦𝑡\hat{y}^{t}over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the predicted 2D spatial coordinate at future tthsuperscript𝑡𝑡t^{th}italic_t start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT timestamp, Wdecodersubscript𝑊𝑑𝑒𝑐𝑜𝑑𝑒𝑟W_{decoder}italic_W start_POSTSUBSCRIPT italic_d italic_e italic_c italic_o italic_d italic_e italic_r end_POSTSUBSCRIPT denotes the parameter matrix to be learned in the LSTM. Based on the comprehensive feature vectors, we can only apply the simplest LSTM decoder to predict the accurate trajectory easily with fewer parameters.

5 Experiment

To evaluate the performance of our model, we perform expensive experiments on real-world datasets. This study uses a consistent segmentation framework for all three datasets. Each sample is divided into 8-second segments, with the first 16 timestamps (3 seconds) serving as historical data and the following 25 timestamps (5 seconds) for evaluation.

5.1 Datasets

Next Generation Simulation (NGSIM): This dataset Deo and Trivedi (2018b) consists of vehicle trajectory datasets from US-101 and I-80, containing approximately 45 minutes of vehicle trajectory data at 10 Hz. It is critical for the analysis of vehicle behavior in a variety of traffic scenarios and assists in the development of reliable AD models.

Highway Drone (HighD): HighD Krajewski et al. (2018) is a dataset of vehicle trajectories collected from six locations on German highways. It includes 110,000 vehicles, including cars and trucks, and a total distance traveled of 45,000 km. This dataset provides detailed information about each vehicle, including type, size, and maneuvers, making it invaluable for advanced vehicle trajectory analysis and AD research.

Macau Connected Autonomous Driving (MoCAD): It Liao et al. (2024b) was collected from the first Level 5 autonomous bus in Macau, which has undergone extensive testing and data collection since its deployment in 2020. The data collection period spans over 300 hours and covers various scenarios, including a 5-kilometer campus road dataset, a 25-kilometer dataset covering city and urban roads, and complex open traffic environments captured under different weather conditions, time periods, and traffic densities.

ModelPrediction Horizon (s)
12345
S-LSTM Alahi et al. (2016)0.651.312.163.254.55
S-GAN Gupta et al. (2018)0.571.322.223.264.40
CS-LSTM Deo and Trivedi (2018a)0.611.272.093.104.37
MATF-GAN Zhao et al. (2019)0.661.342.082.974.13
DRBPGao et al. (2023)1.182.834.225.82-
M-LSTM Deo and Trivedi (2018b)0.581.262.123.244.66
IMM-KF Lefkopoulos et al. (2020)0.581.362.283.374.55
GAIL-GRU Kuefler et al. (2017)0.691.512.553.654.71
MFP Tang and Salakhutdinov (2019)0.541.161.892.753.78
NLS-LSTM Messaoud et al. (2019)0.561.222.023.034.30
MHA-LSTM Messaoud et al. (2021)0.411.011.742.673.83
WSiP Wang et al. (2023)0.561.232.053.084.34
CF-LSTM Xie et al. (2021)0.551.101.782.733.82
TS-GAN Wang et al. (2022)0.601.241.952.783.72
STDAN Chen et al. (2022)0.421.011.692.563.67
BAT Liao et al. (2024b)0.230.811.542.523.62
FHIF Zuo et al. (2023)0.400.981.662.523.63
DACR-AMTP Cong et al. (2023)0.571.071.682.533.40
Our model0.360.861.362.022.85

5.2 Training and Implement Details

We adopt a two-stage training approach to train our model. In the first stage, our model is trained to predict a future trajectory with the Mean Squared Error (MSE) Pan (2020) loss function as follows:

MSE(y^,y)=t=1tf[(yxt^yxt)2+(yyt^yyt)2]subscriptMSE^𝑦𝑦superscriptsubscript𝑡1subscript𝑡𝑓delimited-[]superscript^subscriptsuperscript𝑦𝑡𝑥superscriptsubscript𝑦𝑥𝑡2superscript^superscriptsubscript𝑦𝑦𝑡superscriptsubscript𝑦𝑦𝑡2\mathcal{L}_{\textit{MSE}}(\hat{y},y)=\sum_{t=1}^{t_{f}}[(\hat{y^{t}_{x}}-y_{x%}^{t})^{2}+(\hat{y_{y}^{t}}-y_{y}^{t})^{2}]caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG , italic_y ) = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ ( over^ start_ARG italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG - italic_y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( over^ start_ARG italic_y start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG - italic_y start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ](21)

where (y^xt,y^yt)superscriptsubscript^𝑦𝑥𝑡superscriptsubscript^𝑦𝑦𝑡(\hat{y}_{x}^{t},\hat{y}_{y}^{t})( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) is the 2D spatial coordinate predicted and (yxt,yyt)superscriptsubscript𝑦𝑥𝑡superscriptsubscript𝑦𝑦𝑡(y_{x}^{t},y_{y}^{t})( italic_y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) denotes the corresponding ground-truth coordinate. It helps our model to learn the exact position information that approximates the true trajectory. When the model converges to a point using the MSE loss function, we use a Negative Log-Likelihood (NLL) Kim et al. (2020) loss function. This transition facilitates a more comprehensive exploration of uncertainty within the trajectory prediction.

LossNLL(y^,y)=𝐿𝑜𝑠subscript𝑠𝑁𝐿𝐿^𝑦𝑦absent\displaystyle Loss_{NLL}(\hat{y},y)=italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_N italic_L italic_L end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG , italic_y ) =t=1TFα((σxt)2(Δxt)2+(σyt)2(Δyt)2\displaystyle\sum_{t=1}^{T_{F}}\alpha((\sigma^{t}_{x})^{2}(\Delta_{x}^{t})^{2}%+(\sigma^{t}_{y})^{2}(\Delta_{y}^{t})^{2}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_α ( ( italic_σ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_σ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
2ρxytσxtσyt(Δxt)(Δyt))log(Pt)\displaystyle-2\rho_{xy}^{t}\sigma^{t}_{x}\sigma^{t}_{y}(\Delta_{x}^{t})(%\Delta_{y}^{t}))-log(P^{t})- 2 italic_ρ start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ( roman_Δ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) - italic_l italic_o italic_g ( italic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )

where ΔxtsuperscriptsubscriptΔ𝑥𝑡\Delta_{x}^{t}roman_Δ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT,ΔytsuperscriptsubscriptΔ𝑦𝑡\Delta_{y}^{t}roman_Δ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT represents (yxty^xt)superscriptsubscript𝑦𝑥𝑡superscriptsubscript^𝑦𝑥𝑡(y_{x}^{t}-\hat{y}_{x}^{t})( italic_y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ), (yyty^yt)superscriptsubscript𝑦𝑦𝑡superscriptsubscript^𝑦𝑦𝑡(y_{y}^{t}-\hat{y}_{y}^{t})( italic_y start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ), and σxtsuperscriptsubscript𝜎𝑥𝑡\sigma_{x}^{t}italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , σytsuperscriptsubscript𝜎𝑦𝑡\sigma_{y}^{t}italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT denote the standard deviation of x-coordinate and y-coordinate at the t𝑡titalic_tth timestamp. ρxytsuperscriptsubscript𝜌𝑥𝑦𝑡\rho_{xy}^{t}italic_ρ start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the correlation coefficient between the x𝑥xitalic_x and y𝑦yitalic_y coordinates at t𝑡titalic_tth timestamp. Pt=ρxTρyt1(ρxyt)2superscript𝑃𝑡subscriptsuperscript𝜌𝑇𝑥subscriptsuperscript𝜌𝑡𝑦1superscriptsuperscriptsubscript𝜌𝑥𝑦𝑡2P^{t}=\rho^{T}_{x}\rho^{t}_{y}\sqrt{1-(\rho_{xy}^{t})^{2}}italic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_ρ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT square-root start_ARG 1 - ( italic_ρ start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG is the standard deviation of the probability density function in x𝑥xitalic_x and y𝑦yitalic_y coordinate. α𝛼\alphaitalic_α is an empirical constant value to simplify the calculation.

ModelPrediction Horizon (s)
12345
S-LSTM Alahi et al. (2016)0.220.621.272.153.41
S-GAN Gupta et al. (2018)0.300.781.462.343.41
WSiP Wang et al. (2023)0.200.601.212.073.14
CS-LSTM Deo and Trivedi (2018a)0.220.611.242.103.27
MHA-LSTM Messaoud et al. (2021)0.190.551.101.842.78
NLS-LSTM Messaoud et al. (2019)0.200.571.141.902.91
DRBPGao et al. (2023)0.410.791.111.40-
EA-Net Cai et al. (2021)0.150.260.430.781.32
CF-LSTM Xie et al. (2021)0.180.421.071.722.44
STDAN Chen et al. (2022)0.190.270.480.911.66
GaVa Liao et al. (2024e)0.170.240.420.861.31
Our model0.130.210.320.381.05

ModelPrediction Horizon (s)
12345
S-LSTM Alahi et al. (2016)1.732.463.394.014.93
S-GAN Gupta et al. (2018)1.692.253.303.894.69
CS-LSTM Deo and Trivedi (2018a)1.451.982.943.564.49
MHA-LSTM Messaoud et al. (2021)1.251.482.573.224.20
NLS-LSTM Messaoud et al. (2019)0.961.272.082.863.93
WSiP Wang et al. (2023)0.700.871.702.563.47
CF-LSTM Xie et al. (2021)0.720.911.732.593.44
STDAN Chen et al. (2022)0.620.851.622.513.32
HLTP Liao et al. (2024a)0.550.761.442.393.21
Our model0.390.821.432.082.74

5.3 Comparison to State-of-the-arts

Our model’s performance is evaluated against more than 15 state-of-the-art (SOTA) methods on each referenced dataset.The experimental results, presented in Table 1, show that our model provides significant improvements in trajectory prediction over the prevailing SOTA baselines. Using Root Mean Square Error (RMSE) as the evaluation metric, our model consistently outperforms most baselines, achieving improvements of 29% and 22% over WSiP and STDAN, respectively, over a 5-second horizon. On the HighD dataset, our model consistently outperforms current SOTA baselines, with average improvements ranging from 43%-70% for short-term forecasts (1-3 seconds) and 62%-78% for long-term forecasts (4-5 seconds). These improvements highlight the importance of integrating spatio-temporal and confidence features. It is noteworthy that the prediction error of the HighD dataset is significantly lower than that of the NGSIM dataset for all algorithms, probably due to the fact that the HighD dataset provides more accurate trajectory data, including detailed information on location and speed. In addition, the HighD dataset contains approximately twelve times more samples than the NGSIM dataset. In addition, on the MoCAD dataset, our model excels on busy urban roads, outperforming SOTA baselines by at least 37% for short-term predictions and reducing long-term prediction errors by at least 0.58 metres. These improvements highlight the importance of incorporating characterised diffusion and spatio-temporal interaction networks. In conclusion, our results confirm the effectiveness and efficiency of our model in predicting AV trajectories.

5.4 Ablation Study

Table 4 analyzes five critical components: characterized diffusion, temporal initializer, spatial interaction, temporal deepening, and confidence feature fusion modules. It shows that we test six models, labeled Model A through Model F. Evaluations against the NGSIM datasets reveal that the stripped-down versions (Models A-E) consistently underperform compared to Model F, which includes all components.Importantly, the integration of the characterized diffusion and confidence feature fusion modules significantly enhances performance, underscoring their vital role in improving prediction accuracy. The inclusion of the confidence feature fusion module, which merges spatial-temporal confidence features with maneuver states, enables more precise predictions of the target agent’s trajectory, particularly through the incorporation of characterized diffusion.

ComponentsAblation Models
ABCDEF
Characterized Diffusion
Temporal Encoder
Spatial Encoder
ST Fusion
Decoder
RMSE3.093.052.973.023.162.85

Characterized Diffusion and Spatial-Temporal Interaction Network for Trajectory Prediction in Autonomous Driving (4)

5.5 Qualitative Results

To qualitatively gain insight, we visualize the prediction results to assess the effectiveness of our proposed model. Fig. 4. displays the qualitative results of our model tested on the NGSIM dataset. As we can see, the prediction favorably covers the ground truth trajectory despite different numbers of neighboring agents in different scenes. These results demonstrate the impressive capabilities of the CDSTraj model in feature representation.

6 Conclusion

The intricate dynamics and uncertainties inherent in multi-agent interactions and scene contexts present a significant challenge in the development of fully AVs. To address this challenge, we introduce a novel generative model that employs a dual architecture integrating a characterized diffusion mechanism and a spatial-temporal interaction network. Empirical evaluations on the NGSIM, HighD, and MoCAD datasets demonstrate that our model, CDSTraj, consistently outperforms existing SOTA baselines in prediction accuracy over both short and long terms. Future research will explore the application of this model to pedestrian trajectory predictions and investigate the integration of spatial and temporal information. These endeavours may yield significant advancements in AD technologies.

Acknowledgements

This research is supported by the Science and Technology Development Fund of Macau SAR (File no. 0021/2022/ITP, 0081/2022/A2, 001/2024/SKL), and University of Macau (SRG2023-00037-IOTSC).

References

  • Alahi et al. [2016]Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, LiFei-Fei, and Silvio Savarese.Social lstm: Human trajectory prediction in crowded spaces.In Proceedings of the IEEE CVPR, pages 961–971, 2016.
  • Altché and de LaFortelle [2017]Florent Altché and Arnaud deLaFortelle.An lstm network for highway trajectory prediction.In IEEE 20th ITSC, pages 353–359. IEEE, 2017.
  • Cai et al. [2021]Yingfeng Cai, Zihao Wang, Hai Wang, Long Chen, Yicheng Li, MiguelAngel Sotelo, and Zhixiong Li.Environment-attention network for vehicle trajectory prediction.IEEE Transactions on Vehicular Technology, 70(11):11216–11227, 2021.
  • Chen et al. [2022]Xiaobo Chen, Huanjia Zhang, Feng Zhao, YuHu, Chenkai Tan, and Jian Yang.Intention-aware vehicle trajectory prediction based on spatial-temporal dynamic attention network for internet of vehicles.IEEE Transactions on Intelligent Transportation Systems, 23(10):19471–19483, 2022.
  • Cong et al. [2023]Peichao Cong, Yixuan Xiao, Xianquan Wan, Murong Deng, Jiaxing Li, and Xin Zhang.Dacr-amtp: adaptive multi-modal vehicle trajectory prediction for dynamic drivable areas based on collision risk.IEEE Transactions on Intelligent Vehicles, 2023.
  • Deo and Trivedi [2018a]Nachiket Deo and MohanM Trivedi.Convolutional social pooling for vehicle trajectory prediction.In Proceedings of the IEEE CVPR Workshops, pages 1468–1476, 2018.
  • Deo and Trivedi [2018b]Nachiket Deo and MohanM Trivedi.Multi-modal trajectory prediction of surrounding vehicles with maneuver based lstms.In IEEE IV, pages 1179–1184. IEEE, 2018.
  • Gao et al. [2023]Kai Gao, Xunhao Li, Bin Chen, Lin Hu, Jian Liu, Ronghua Du, and Yongfu Li.Dual transformer based prediction for lane change intentions and trajectories in mixed traffic environment.IEEE Transactions on Intelligent Transportation Systems, 2023.
  • Gupta et al. [2018]Agrim Gupta, Justin Johnson, LiFei-Fei, Silvio Savarese, and Alexandre Alahi.Social gan: Socially acceptable trajectories with generative adversarial networks.In Proceedings of the IEEE CVPR, pages 2255–2264, 2018.
  • Ho et al. [2020]Jonathan Ho, Ajay Jain, and Pieter Abbeel.Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020.
  • Ho et al. [2022]Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, DiederikP Kingma, Ben Poole, Mohammad Norouzi, DavidJ Fleet, etal.Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022.
  • Hu [2020]Dichao Hu.An introductory survey on attention mechanisms in nlp problems.In Intelligent Systems and Applications: Proceedings of the 2019 Intelligent Systems Conference (IntelliSys) Volume 2, pages 432–448. Springer, 2020.
  • Huang et al. [2022]Yanjun Huang, Jiatong Du, Ziru Yang, Zewei Zhou, Lin Zhang, and Hong Chen.A survey on trajectory-prediction methods for autonomous driving.IEEE Transactions on Intelligent Vehicles, 7(3):652–674, 2022.
  • Kim et al. [2017]ByeoungDo Kim, ChangMook Kang, Jaekyum Kim, SeungHi Lee, ChungChoo Chung, and JunWon Choi.Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network.In IEEE ITSC, pages 399–404. IEEE, 2017.
  • Kim et al. [2020]Hayoung Kim, Dongchan Kim, Gihoon Kim, Jeongmin Cho, and Kunsoo Huh.Multi-head attention based probabilistic vehicle trajectory prediction.In 2020 IEEE Intelligent Vehicles Symposium (IV), pages 1720–1725. IEEE, 2020.
  • Krajewski et al. [2018]Robert Krajewski, Julian Bock, Laurent Kloeker, and Lutz Eckstein.The highd dataset: A drone dataset of naturalistic vehicle trajectories on german highways for validation of highly automated driving systems.In 21st ITSC, pages 2118–2125. IEEE, 2018.
  • Kuefler et al. [2017]Alex Kuefler, Jeremy Morton, Tim Wheeler, and Mykel Kochenderfer.Imitating driver behavior with generative adversarial networks.In IEEE IV, pages 204–211. IEEE, 2017.
  • Lee et al. [2017]Namhoon Lee, Wongun Choi, Paul Vernaza, ChristopherB Choy, PhilipHS Torr, and Manmohan Chandraker.Desire: Distant future prediction in dynamic scenes with interacting agents.In Proceedings of the IEEE CVPR, pages 336–345, 2017.
  • Lefkopoulos et al. [2020]Vasileios Lefkopoulos, Marcel Menner, Alexander Domahidi, and MelanieN Zeilinger.Interaction-aware motion prediction for autonomous driving: A multiple model kalman filtering scheme.IEEE Robotics and Automation Letters, 6(1):80–87, 2020.
  • Liao et al. [2024a]Haicheng Liao, Yongkang Li, Zhenning Li, Chengyue Wang, Zhiyong Cui, ShengboEben Li, and Chengzhong Xu.A cognitive-based trajectory prediction approach for autonomous driving.IEEE Transactions on Intelligent Vehicles, pages 1–12, 2024.
  • Liao et al. [2024b]Haicheng Liao, Zhenning Li, Huanming Shen, Wenxuan Zeng, Dongping Liao, Guofa Li, and Chengzhong Xu.Bat: Behavior-aware human-like trajectory prediction for autonomous driving.In Proceedings of the AAAI Conference on Artificial Intelligence, volume38, pages 10332–10340, 2024.
  • Liao et al. [2024c]Haicheng Liao, Zhenning Li, Chengyue Wang, Huanming Shen, Bonan Wang, Dongping Liao, Guofa Li, and Chengzhong Xu.Mftraj: Map-free, behavior-driven trajectory prediction for autonomous driving, 2024.
  • Liao et al. [2024d]Haicheng Liao, Zhenning Li, Chengyue Wang, Bonan Wang, Hanlin Kong, Yanchen Guan, Guofa Li, Zhiyong Cui, and Chengzhong Xu.A cognitive-driven trajectory prediction model for autonomous driving in mixed autonomy environment.arXiv preprint arXiv:2404.17520, 2024.
  • Liao et al. [2024e]Haicheng Liao, Shangqian Liu, Yongkang Li, Zhenning Li, Chengyue Wang, Bonan Wang, Yanchen Guan, and Chengzhong Xu.Human observation-inspired trajectory prediction for autonomous driving in mixed-autonomy traffic environments.arXiv preprint arXiv:2402.04318, 2024.
  • Liao et al. [2024f]Haicheng Liao, Huanming Shen, Zhenning Li, Chengyue Wang, Guofa Li, Yiming Bie, and Chengzhong Xu.Gpt-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models.Communications in Transportation Research, 4:100116, 2024.
  • Messaoud et al. [2019]Kaouther Messaoud, Itheri Yahiaoui, Anne Verroust-Blondet, and Fawzi Nashashibi.Non-local social pooling for vehicle trajectory prediction.In IEEE IV, pages 975–980. IEEE, 2019.
  • Messaoud et al. [2021]Kaouther Messaoud, Itheri Yahiaoui, Anne Verroust-Blondet, and Fawzi Nashashibi.Attention based vehicle trajectory prediction.IEEE Transactions on Intelligent Vehicles, 6(1):175–185, 2021.
  • Pan [2020]Yunhe Pan.Multiple knowledge representation of artificial intelligence.Engineering, 6(3):216–217, 2020.
  • Poole et al. [2022]Ben Poole, Ajay Jain, JonathanT Barron, and Ben Mildenhall.Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022.
  • Prevost et al. [2007]CaroleG Prevost, Andre Desbiens, and Eric Gagnon.Extended kalman filter for state estimation and trajectory prediction of a moving object detected by an unmanned aerial vehicle.In 2007 American control conference, pages 1805–1810. IEEE, 2007.
  • Ramesh et al. [2022]Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen.Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  • Rombach et al. [2022]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.High-resolution image synthesis with latent diffusion models.In Proceedings of the IEEE/CVF CVPR, pages 10684–10695, 2022.
  • Tang and Salakhutdinov [2019]Charlie Tang and RussR Salakhutdinov.Multiple futures prediction.Advances in neural information processing systems, 32, 2019.
  • Wang et al. [2022]YuWang, Shengjie Zhao, Rongqing Zhang, Xiang Cheng, and Liuqing Yang.Multi-vehicle collaborative learning for trajectory prediction with spatio-temporal tensor fusion.IEEE Transactions on Intelligent Transportation Systems, 23(1):236–248, 2022.
  • Wang et al. [2023]Renzhi Wang, Senzhang Wang, Hao Yan, and Xiang Wang.Wsip: Wave superposition inspired pooling for dynamic interactions-aware trajectory prediction.In Proceedings of the AAAI Conference on Artificial Intelligence, volume37, pages 4685–4692, 2023.
  • Xie et al. [2021]XuXie, Chi Zhang, Yixin Zhu, YingNian Wu, and Song-Chun Zhu.Congestion-aware multi-agent trajectory prediction for collision avoidance.In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13693–13700. IEEE, 2021.
  • Zhao et al. [2019]Tianyang Zhao, Yifei Xu, Mathew Monfort, Wongun Choi, Chris Baker, Yibiao Zhao, Yizhou Wang, and YingNian Wu.Multi-agent tensor fusion for contextual trajectory prediction.In Proceedings of the IEEE/CVF CVPR, pages 12126–12134, 2019.
  • Zhou et al. [2021]Hao Zhou, Dongchun Ren, Huaxia Xia, Mingyu Fan, XuYang, and Hai Huang.Ast-gnn: An attention-based spatio-temporal graph neural network for interaction-aware pedestrian trajectory prediction.Neurocomputing, 445:298–308, 2021.
  • Zuo et al. [2023]Zhiqiang Zuo, Xinyu Wang, Songlin Guo, Zhengxuan Liu, Zheng Li, and Yijing Wang.Trajectory prediction network of autonomous vehicles with fusion of historical interactive features.IEEE Transactions on Intelligent Vehicles, 2023.
Characterized Diffusion and Spatial-Temporal Interaction Network for Trajectory Prediction in Autonomous Driving (2024)
Top Articles
Latest Posts
Article information

Author: Dong Thiel

Last Updated:

Views: 5858

Rating: 4.9 / 5 (59 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Dong Thiel

Birthday: 2001-07-14

Address: 2865 Kasha Unions, West Corrinne, AK 05708-1071

Phone: +3512198379449

Job: Design Planner

Hobby: Graffiti, Foreign language learning, Gambling, Metalworking, Rowing, Sculling, Sewing

Introduction: My name is Dong Thiel, I am a brainy, happy, tasty, lively, splendid, talented, cooperative person who loves writing and wants to share my knowledge and understanding with you.