Paper Reading: Pedestrian Crossing Action Recognition and Trajectory Prediction with 3D Human Keypoints
Published:
1. Introduction
This paper presents a multi-task learning framework for pedestrian crossing action recognition and trajectory prediction, which utilizes 3D human keypoints extracted from raw sensor data to capture rich information on human pose and activity. Apply two auxiliary tasks and contrastive learning to enable auxiliary supervisions to improve the learned keypoints representation, which further enhances the performance of major tasks.
2. Releated work
Crossing Action Recognition: A binary classification problem.
Trajectory Prediction: Many existing methods utilize the high-level history state information (e.g., position, velocity) and the context information (e.g., roadgraph/map, context agent trajectory) to forecast future state sequences. There are two widely used ways to represent the roadgraph information: (a) rasterized top-down view images; and (b) roadgraph vectors.
Human Keypoints Encoding: As the development of graph representation learning, graph neural networks are adopted to extract spatio-temporal feature representations from a spatio-temporal graph constructed by the human keypoints sequence, where the nodes represent human joints and edges represent the bones between joints. In this work, we employ the Spatio-Temporal Graph Convolutional Network proposed in [47] as the backbone of our keypoints encoder.
3. Problem Formulation
4. Method
A. Model Overview
The encoder
- A context encoding channel EC: Generates the embedding of roadgraph and contect agents.
- A keypoints encoding channel EK: Extracts the embedding of human pose and activity.
- A keypoints encoding channel ET: Extracts target pedestrian track features.
The decoder
- Two individual heads fro major tasks
- crossing action recognition
- trajectory prediction
- Three auxiliary heads to enable auxiliary supervision for keypoints representation learning.
B. Encoder
Context ENcoder: We adopt the vectorized representation proposed in [32] to represent the roadgraph and the historical trajectories of context traffic participants which may have interactions with the target pedestrian. Specifically, the roadgraph (i.e., lanes, traffic signs) and trajectories are transformed into polylines with a variable number of vectors respectively.
Keypoints Encoder: Since the human skeleton can be naturally represented as a graph where nodes are human joints and edges are bones, we construct a spatio-temporal graph to represent a keypoints sequence where the same joint nodes at consecutive frames are linked by temporal edges. We adopt the spatial configuration partitioning strategy [47] to obtain the adjacency matrix at each frame. We employ the Spatio-Temporal Graph Convolutional Network with the same layer architecture in [47] to extract both spatial and temporal patterns at different scales. A global average pooling layer is applied after graph convolutional layers to generate the final keypoints embedding.
Target Track Encoder: We use a recurrent neural network as the target track encoder ET to encode the history trajectory of the target pedestrian. The extracted embedding is concatenated with the keypoints embedding to obtain the target node embedding.
Global Interaction: The polyline subgraphs and target node are used to construct a fully connected global interaction graph and message passing is applied to model the agent-agent and agent-road interactions between scene elements. We implement this global interaction graph as a selfattention layer, as in [32]. The output of the global interaction module is a global context embedding for each modeled agent while we only use that of the target pedestrian.
C. Decoder
Crossing Action Recognition Head: The crossing action recognition head is a fully-connected layer whose output are the probabilities of crossing/non-crossing actions. We adopt a standard binary cross-entropy loss LAR for training.
Trajectory Prediction Head: We adopt the TargetdriveN Trajectory prediction method proposed in [5], which consists of three stages: (a) target prediction; (b) targetconditioned trajectory generation; and (c) trajectory scoring.
D. Auxiliary Supervisions
- Keypoints Jigsaw Puzzle (KJP)
- Keypoints Prediction (KP)
- Keypoints Contrastive Learning (KCL)
E. Loss Functions and Training
Since the proposed framework deals with multiple tasks simultaneously, the complete loss function is a weighted sum of all the loss components brought by each head.
The key takeaway from this study is by implementing keypoints jigsaw puzzle, keypoints prediction, and keypoints contrastive learning as auxiliary supervisions, we observed a marked improvement in both accuracy and computational efficiency. This might be a good approach for person reid tasks.