Symbiotic Graph Neural Networks
for 3D Skeletonbased Human Action Recognition and Motion Prediction
Abstract
3D skeletonbased action recognition and motion prediction are two essential problems of human activity understanding. In many previous works: 1) they studied two tasks separately, neglecting internal correlations; 2) they did not capture sufficient relations inside the body. To address these issues, we propose a symbiotic model to handle two tasks jointly; and we propose two scales of graphs to explicitly capture relations among bodyjoints and bodyparts. Together, we propose symbiotic graph neural networks, which contains a backbone, an actionrecognition head, and a motionprediction head. Two heads are trained jointly and enhance each other. For the backbone, we propose multibranch multiscale graph convolution networks to extract spatial and temporal features. The multiscale graph convolution networks are based on jointscale and partscale graphs. The jointscale graphs contain actional graphs, capturing actionbased relations, and structural graphs, capturing physical constraints. The partscale graphs integrate bodyjoints to form specific parts, representing highlevel relations. Moreover, dual bonebased graphs and networks are proposed to learn complementary features. We conduct extensive experiments for skeletonbased action recognition and motion prediction with four datasets, NTURGB+D, Kinetics, Human3.6M, and CMU Mocap. Experiments show that our symbiotic graph neural networks achieve better performances on both tasks compared to the stateoftheart methods. The code is relased at github.com/limaosen0/SymGNN
1 Introduction
Human action recognition and motion prediction are crucial problems in computer vision, being widely applicable to surveillance [2], pedestrian tracking [3], and humanmachine interaction [4]. Respectively, action recognition aims to accurately classify the categories of query actions [5]; and motion prediction forecasts the future movements based on observations [6].
The data of actions can be represented with various formats, including RGB videos [7] and 3D skeleton data [8]. Notably, 3D skeleton data, locating 3D bodyjoints, is shown to be effective in action representation, efficient in computation, as well as robust against environmental noise [9]. In this work, we focus on action recognition and motion prediction based on the 3D skeleton data.
In most previous studies, 3D skeletonbased action recognition and motion prediction are treated separately as the former needs to discriminate the classes; while the latter generates the future poses. For action recognition, methods employed full action sequences for pattern learning [10, 11, 12, 13, 14]; however, with the longterm inputs, these methods failed in some realtime applications due to the hysteretic discrimination, while the model should response as early as possible. As for motion prediction, previous works built generative models [15, 16, 17, 18]; they learned motion dynamics, but often ignored semantics. Actually, there are mutual promotions between the tasks of action recognition and motion prediction, while previous works rarely explored them. For example, the classifier provides the action categories as the auxiliary information to guide prediction, as well as the predictor preserves more detailed information for accurate recognition via selfsupervision. Considering to exploit mutual promotions, we aim to develop a symbiotic method to enable action recognition and motion prediction simultaneously.
For both 3D skeletonbased action recognition and motion prediction, the key is to effectively capture the motion patterns of various actions. A lot of efforts have been made to push towards this direction. Concretely, some traditional attempts often vectorized all the joints to a pose vector and built handcrafted models for feature learning [19, 20, 10, 15, 21, 22, 23]. Recently, some deep models based on either convolutional neural networks (CNN) or recurrent neural networks (RNN) learned highlevel features from data [11, 24, 12, 13, 16, 17, 25, 18, 26, 27]; however, these methods rarely investigated the joint relations, missing crucial activity dynamics. To capture richer features, several works exploited joint relations from various aspects. [14] proposed skeletongraphs with nodes as joints and edges as bones. [28, 29, 16] built the relations between bodyparts, such as limbs. [30] merged individual part features. [31] leveraged spatial convolutions on pose vectors, but it varied on joint permutation. These works aggregated information from local or coarse neighborhoods. Notably, some relations may exist among actionrelated joints, such as hands and feet moving collaboratively during walking. Moreover, some methods of motion prediction fed groundtruth action categories to enhance performance in both training and testing phases, but the true labels are hard to obtain during the realworld scenarios. To solve those issues, we construct graphs to model both local and longrange body relations and use graph convolutions to capture informative spatial features.
In this paper, we propose a novel model called symbiotic graph neural network (SymGNN), which handles 3D skeletonbased action recognition and motion prediction simultaneously and uses graphbased operations to capture spatial features. As basic operators of SymGNN, we propose the jointscale graph convolution (JGC) and partscale graph convolution (PGC) operators to extract multiscale spatial information. JGC is based on two types of graphs: actional graphs and structural graphs. The actional graphs are learned from 3D skeleton data by an actional graph inference module (AGIM), capturing actionbased relations; the structural graphs are built by extending the skeleton graphs, capturing physical constraints. PGC is based on a partscale graph, whose nodes are integrated bodypart features and edges are based on bodypart connections. We also propose a difference operator to extract multiple orders of motion differences, reflecting positions, velocities, and accelerations of bodyjoints.
The proposed SymGNN consists of a backbone, called multibranch multiscale graph convolutional network (multibranch multiscale GCN), an actionrecognition head and a motionprediction head; see Fig. 1. The backbone uses jointscale and partscale graphs for spatial relations presentation and highlevel feature extraction; two heads work on two separated tasks. Moreover, there are task promotions, i.e. the actionrecognition head determines action categories, which is used to enhance prediction performance; meanwhile, the motionprediction head predicts poses and improves recognition by promoting selfsupervision and preserving detailed features. The model is trained through a multitasking paradigm. Additionally, we build a dual bonebased network which treats bones as graph nodes and learns bone features to obtain complementary semantics for more effective classification and prediction.
To validate the SymGNN, we conduct extensive experiments on four largescale datasets: NTURGB+D [32], Kinetics [14], Human 3.6M [33], and CMU Mocap^{1}^{1}1http://mocap.cs.cmu.edu/. The results show that 1) SymGNN outperforms the stateoftheart methods in both action recognition and motion prediction; 2) Using the symbiotic model to train the two tasks simultaneously produces better performance than using individual models; and 3) the multiscale graphs model complicated relations between bodyjoints and bodyparts, and the proposed JGC extract informative spatial features.
Overall, the main contributions in this paper are summarized as follows:

Multitasking framework. We propose novel symbiotic graph neural networks (SymGNN) to achieve 3D skeletonbased action recognition and motion prediction in a multitasking framework. SymGNN contains a backbone, an actionrecognition head, and a motionprediction head. We exploit the mutual promotion between two heads, leading to improvements in both tasks; see Section 5

Basic operators. We propose novel operators to extract information from 3D skeleton data: 1) a jointscale graph convolution operator is proposed to extract jointlevel spatial features based on both actional and structral graphs; see Section 4.1.4; 2) a partscale graph convolution operator is proposed to extract partlevel spatial features based on partscale graphs; see Section 4.2.1; 3) a pair of bidirectional fusion operators is proposed to fuse information across two scales; see Section 5.1.2 and 4) a difference operator is proposed to extract temporal features; see Section 4.3; and

Experimental findings. We conduct extensive experiments for both tasks of 3D skeletonbased action recognition and motion prediction. The results show that SymGNN outperforms the stateoftheart methods in both tasks; see Section 6
2 Related Works
3D skeletonbased action recognition. Numerous methods are proposed for 3D skeletonbased action recognition. Conventionally, some models learned semantics based on handcrafted features and physical intuitions [19, 20, 10]. In the deep learning era, models automatically learn features from data. Some recurrentneuralnetworkbased (RNNbased) models captured the temporal dependencies between consecutive frames [12, 34]. Moreover, convolutional neural networks (CNN) also achieve remarkable results [24, 13]. Recently, the graphbased approaches drew many attentions [14, 28, 29, 35, 36, 1, 37, 38]. In this work, we adopt the graphbased approach. We construct multiscale graphs adaptively from data, capturing useful and comprehensive information about actions.
3D skeletonbased motion prediction. In earlier studies, state models were considered to predict future motions [15, 21, 39]. Recently, deep learning technique plays increasingly important roles. Some RNNbased methods learned the dynamics from sequences [16, 17, 40, 41, 42, 27]. Moreover, adversarial mechanics and geodesic loss could further improve predictions [18]. As for our method, we use graph structures to explicitly model the relations between bodyjoints and bodyparts, guiding the networks to learn local and nonlocal motion patterns.
Graph deep learning. Graphs, focused on by many recent studies, are effective to express data associated with nongrid structures [14, 43, 44, 45, 46, 47, 48, 28]. Given the fixed topologies, previous works explored to propagate node features based on the spectral domain [46, 47] or the vertex domain [48]. [14, 1, 29, 36, 35] leveraged graph convolution for 3D skeletonbased action recognition. [16] also considered the skeletonbased relations for motion prediction. In this paper, we propose multiscale graphs to represent multiple relations: jointscale and partscale relations. Then, we propose novel graph convolution operators to extract deep features for action recognition and motion prediction. Different from [1] obtaining multiple actional graphs with complicated inference processes, our method employs more efficient graph learning operations.
3 Problem Formulation
In this paper, we study 3D skeletonbased action recognition and motion prediction jointly. We here use the 3D joint positions along the time to represent the action sequences. Mathematically, let the action pose at time stamp be , where indicates the future frames, otherwise the observed frames; notably, denotes the current frame. is the number of joints and reflects the 3D joint positions. The action pose is essentially associated with a skeleton graph, which represents the pairwise bone connectivity. We can represent a skeleton graph by a binary adjacent matrix; that is, , where the th elements when the th and the th bodyjoints are connected with bones, and , otherwise. Note that includes selfloops.
For an action sequence belonging to one class, we have , where denotes the previous motion tensor; denotes the future motion tensor; and are the frame numbers of previous and future motions, respectively; and onehot vector denotes the classlabel in possible classes. Let be the overall model. The discriminated class category and the predicted motion are formulated as
where , and denote trainable parameters of the backbone, the actionrecognition head and the motionprediction head, respectively.
4 Basic Components
In this section, we propose some novel components in our model. We first propose some jointscale graph operators, extracting features among bodyjoints; we next propose the partscale graph operators, extracting features among bodyparts; finally, we propose a difference operator to provide richer motion priors.
4.1 JointScale Graph Operators
To model the joint relations, we build jointscale graphs including actional graphs, capturing moving interactions between joints even without boneconnection, and structural graphs, extending the skeleton structures to represent physical constraints. Fig. 2 sketches some examples. Plot (a) shows a skeleton graph with local neighborhood; plot (b) shows an actional graph, which captures actionbased dependencies, e.g. ‘Left Hand’ is linked with ‘Right Hand’ and feet during walking; plot (c) shows a structural graph, which allows ‘Left Hand’ to link with entire arm.
As follows, we propose the construction of jointscale actional and structural graphs. And, we present the jointscale graph and temporal convolution (JGTC) block to learn spatial and temporal features of sequential actions.
4.1.1 Actional Graph Convolution
For different movements, some structurally distant joints may interact, leading to actionbased relations. For example, a walking person moves hands and feet collaboratively. To represent actional relations, we employ an actional graph: , where is the joint set and is the adjacency matrix that reveals the pairwise jointscale actional relations. To obtain this topology, we propose a dataadaptive module, called actional graphs inference module (AGIM), to learn purely from observations without knowing action categories.
To utilize the body dynamics, we let the vector representation of the th joint positions across all observed frames be , which includes the previous positions. To learn all relations, we propagate pose information between bodyjoints and possible edges. We first initialize , where is a multilayer perceptron (MLP) that maps the raw joint moving data to joint features . In the th iteration, the features are propagated as follows:
(1a)  
(1b) 
where , are the feature vectors of the th joints and the edge connecting the th and th joints at the th iteration; and are two MLPformed feature extractors; is the concatenation; and denote the dimensions of edge and joint features, respectively. (1a) maps a pair of joint features to the inbetween edge features; (1b) aggregates all edge features associated with the same joint and maps to the corresponding joint features. After iterations, information are fully propagated between joints and edges; in other words, each joint feature obtained has aggregated the integrated information in a longrange.
Given any joint feature after iterations, , we compute the relation strength between each pair of joints, leading to an actional graph. We build two individual embedding networks, and , to further learn the highlevel representations of joints. The th element of the adjacent matrix of actional graph is formulated as
(2) 
where and are the two different embeddings of joint . Notably, , indicating incoming and outgoing relations between joints. (2) uses the softmax to normalize the edge weights and promote a few large ones.
The structure of AGIM is illustrated in Fig. 3, where the information is propagated for times between joints and any possible edges and two embedded joint features are used to calculate actional graphs in the end.
Given the jointscale actional graphs with adjacent matrix , we aggregate the joint information along the actionbased relations. We design an actional graph convolution (AGC) to capture the actional features. Mathematically, let the input features at frame be , the output features be , the AGC works as
(3) 
where is a trainable weight. Therefore, the model aggregates the actionbased information from collaboratively moving joints even in the distance.
4.1.2 Structural Graph Convolution
Intuitively, the joint dynamics is limited due to physical constraints, namely bone connections. To capture these relations, we develop a structural graph, . Let is the adjacency matrix of the skeleton graph (see Section 3), the normalized adjacency matrix be
where is a diagonal degree matrix with . provides nice initialization to learn the edge weights and avoids multiplication explosion [49, 50].
We note that only describes the 1hop neighborhood on body; that is, the boneconnected joints. To represent longrange relations, we use the highorder polynomial of . Let the order polynomial of be , which could be directly computed from the skeleton structure; indicates the relations between each joint and its hop neighbors on skeleton. Given the highorder topologies, we introduce several individual edgeweight matrices corresponding to , where each element is trainable to reflect the relation strength. We finally obtain the order structural graph, whose weighted adjacency matrices is
where denotes the elementwise multiplication. In this way, we are able to model the structurebased relations between one joint and others in relatively longer range. Practically, we consider the order , thus we have multiple structural graphs for one body. See plot (c) in Fig. 2, the hand is correlated with the entire arm.
Given the structural graphs , we propose the structural graph convolution (SGC) operator. Let the input feature at frame be , the output feature be , the SGC operator is formulated as
(4) 
where is the trainable model parameters. Notably, the multiple structural graphs have different corresponding weights, which help to extract richer features.
4.1.3 JointScale Graph Convolution
To extract both spatial and temporal features of actions, we now propose the jointscale graph and temporal convolution block. Based on AGC (3) and SGC (4), we present the jointscale graph convolution (JGC) to capture comprehensive jointscale spatial features. Mathematically, let the input joint features at frame be , the output features be , the JGC is formulated as
(5) 
where is a hyperparameter to trade off the contribution between actional and structural features. Some nonlinear activation functions can be applied to it. In this way, the joint features are effectively aggregated to update each center joint according to the jointscale graphs.
We further show the stability of the proposed activated jointscale graph convolution layer; that is, when input 3D skeleton data is disturbed, the distortion of the output features is upper bounded.
Theorem
(Stability) Let two jointscale feature matrices be and associated with a skeleton graph , where and (). Let and . Let and be the jointscale actional graph inferred from and , respectively, where
with the amplify factor and some constant. Let , and , where , and . Then,
Note that denotes Frobenius norm and is zero norm. denotes ReLUactivation on each element of the data. denotes the effects that rely on ‘’.
See the proof in Appendix. Theorem 4.1.3 only shows the jointscale graph convolution at the first layer, but the bound can be extended to the subsequent layers. We make this distinction because the actional graph only depends on the input data. Theorem 4.1.3 shows that 1) the outputs of JGC followed by the activation function can be upper bounded, reflecting its robustness against perturbation of inputs; and 2) given a fixed model, the bound is mainly related to the amplify factor , reflecting how much the actional graph would amplify the perturbation. In the experiments, we show that is around 1. We also test JGC’s robustness against the input perturbation, ensuring that SymGNN has stable performance given small noises.
4.1.4 JointScale Graph and Temporal Convolution Block
While the JGC operator leverages the joint spatial relations and extracts rich features, we should consider modeling the temporal dependencies among consecutive frames. We develop the temporal convolution operator (TC); that is, a convolution along time to learn the movements. Stacking JGC and TC, we build the jointscale graph and temporal convolution block (JGTC block), which learn the spatial and temporal features in tandem. Mathematically, let be an input tensor, each JGTC block works as
(6a)  
(6b) 
where represents a nonlinear ReLU function, is a standard 1D convolution along the time axis, whose temporal kernel size is ; is the convolution stride along time to shrink the temporal dimension; is the time stamp. In each JGTC block, (6a) extracts spatial features using the multiple spatial relations between joints; and (6b) extracts temporal features by aggregating the information in several consecutive frames. Our JGTC also includes batch normalizations and dropout operations. Moreover, there is a residual connection preserving the input features. The architecture of one JGTC block is illustrated in Fig. 4.
By stacking several JGTC blocks in a hierarchy, we could gradually convert the motion dynamics from the sample space to the feature space; that is we capture the highlevel semantic information of input sequence for action recognition and motion prediction downstream.
4.2 PartScale Graph Operators
The jointscale graphs treat bodyjoints as nodes and model their relations, but some action patterns depend on more abstract movements of bodyparts. For example, ‘hand waving’ shows a rising arm, but the finger and wrist are less important. To model the part dynamics, we propose a partscale graph and the partscale graph and temporal convolution (PGTC) block to extract partscale features.
4.2.1 PartScale Graph Convolution
For a partscale graph, we define bodyparts as graph nodes: ‘head’, ‘torso’, pairs of ‘upper arms’, ‘forearms’, ‘thighs’ and ‘crura’, which integrates the covered joints on jointscale body. And we build the edges according to body nature. The right plot of Fig. 5 shows an example of partscale graph, whose vertices are parts and edges are based on nature. The selflooped binary adjacent matrix of partscale graph is , where if the th and th parts are connected. We normalize by
where is the diagonal degree matrix of ; is a trainable weight matrix and is the elementwise multiplication.
Similarly to the JGC operator (5), we propose the partscale graph convolution (PGC) for spatial feature learning. Let the part features at time be , the output features be , the PGC works as
(7) 
where is the trainable parameters. With (7), we propagate information between bodyparts on the partscale graph, leading to abstract spatial patterns. Notably, We do not need a partscale actional graph, because the partscale graph includes some integrated relations internally, as well as it has a shorter distance to build longrange links.
4.2.2 PartScale Graph and Temporal Convolution Block
Considering the temporal evolution, we use the same temporal convolution as in JGTC block to form the partscale graph and temporal convolution block (PGTC block). Let the input part feature tensor be , we have
(8a)  
(8b) 
where denotes the time stamp and is the temporal convolution stride. Comparing to the JGTC block, the PGTC block extracts the spatial and temporal features of actions in a higher scale.
4.3 Difference Operator
Intuitively, the states of motion, such as velocity and acceleration, carry important dynamics information and make it easier to extract spatialtemporal features. To achieve this, we employ a difference operator to preprocess the input sequences. The idea is to compute highorder differences of the pose sequences, guiding the model to learn motion information more easily. The zeroorder difference is , where is the pose at the time , and the order difference () of the pose is
(9) 
where denotes the thorder difference operator. We use zero paddings to handle boundary conditions. We take the first three orders () to our model, reflecting positions, velocities, and accelerations. In the model, the three differences can be efficiently computed in parallel.
5 Symbiotic Graph Neural Networks
To construct the multitasking model, we need a deep backbone for highlevel action pattern extraction as well as two taskspecific modules. In this section, we present the architecture of our Symbiotic Graph Neural Networks (SymGNN). First, we present the deep backbone network, which uses multiscale graphs for feature learning; We then present the actionrecognition head and the motionprediction head with an effective multitasking scheme. Finally, we present a dual network, which learns features from body bones, instead of bodyjoints, and provides complementary information for downstream tasks.
5.1 Backbone: MultiBranch MultiScale Graph Convolution Networks
To learn the highlevel action pattern, the proposed SymGNN consists of a deep backbone called multibranch multiscale graph convolution network (multibranch multiscale GCN). It employs parallel multiscale GCN branches to treat highorder action differences for rich dynamics learning and also considers multiscale graphs for spatial feature extraction. Fig. 6 shows the backbone, where the left plot is the backbone framework including three branches of multiscale GCNs; the right plot is the structure of each branch of multiscale GCN. As follows, we propose the backbone architecture in detail.
5.1.1 Multiple Branches
The backbone has three branches of multiscale GCN. Each branch uses a distinct order of action differences as input, treating the motion states for dynamics learning (see Fig. 6). Calculating the differences by difference operators, we first obtain the proxies of ‘positions’, ‘velocities’ and ‘accelerations’ as the input of the network; see (9). The three branches have identical network architectures. We obtain semantics of highorder differences and concatenate them together for action recognition and motion prediction.
5.1.2 MultiScale GCN
To learn the detailed and general action features comprehensively, each branch of the backbone is a multiscale GCN based on two scales of graphs: jointscale and partscale graphs. For each scale, we use the corresponding operators, i.e. JGTC blocks (see section 4.1.4) and PGTC block (see section 4.2.1) to extract spatial and temporal features.
Concretely, in the joint scale, the body is modeled by jointscale graphs, where both the actional graph and structural graphs are used to capture bodyjoint correlations. This joint scale uses a cascade of JGTC blocks based on the learned actionalstructural graphs. In the part scale, we use partscale graphs whose nodes are bodyparts to represent highlevel body instances, and we stack multiple PGTC blocks for feature capturing. To be aware of the multiscale immediate representations and learn rich and consistent patterns, we introduce a fusion mechanism between the hidden layers of two scales; called bidirectional fusion.
Bidirectional Fusion. The bidirectional fusion exchanges features from both the joint scale and the part scale; see illustrations in Fig. 5 and Fig. 6. It contains two operations:

Joint2part pooling. For the joint scale, we use pooling to average the joint features on the same part to represent a super node. Then, we concatenate the pooling result to the corresponding part feature in the part scale. As shown in Fig. 5, we average torso joints to obtain a node in the partscale graph and concatenate it to the original partscale features.

Part2joint matching. For the part scale, the part features are copied for several times to match the number of corresponding joints, as well as we concatenate the copied parts to the joints. As shown in Fig. 5, we copy the thigh twice and concatenate them to the hip and knee in the joint scale.
Fig. 6 (right plot) shows the internal operations and one bidirectional fusion in multiscale GCN. Given the jointscale input attributes, we first use a JGTC block to extract the initial jointscale features, and a joint2part pooling is applied on the jointscale features to compute the initial partscale features. We next feed them into two parallel JGTC and PGTC blocks. Then we concatenate the responses mapped by joint2part pooling and part2joint matching to the features in opposite scales. Therefore, both scales have good adaptability to multiscale information. After multiple interactive JGTC and PGTC blocks in the multiscale GCN, we fuse the outputs of two scales through summation, followed by the average pooling to remove the temporal dimension, and obtain the highlevel features.
Finally, we concatenate the outputs from three branches together and use them as the comprehensive semantics for action recognition and motion prediction.
5.2 Multitasking I: Action Recognition
For action recognition, SymGNN can be represented as , where is the recognition submodel of entire SymGNN, ; that is, the recognition performance depends on the backbone network and a recognition module constructed following the backbone. Given the highlevel features extracted by three branches of backbone, , and , we concatenate them and employ an MLP to produce the fused feature:
where denotes the fusing network of recognition task and is the concatenation operator of three matrices along feature dimension. To integrate the joint dynamics, we apply the global averaging pooling on the joints of and obtain a feature vector that represents the whole body. To generate the recognition results, we finally feed the vector into a 1layer network with a softmax classifier, obtaining .
5.3 Multitasking II: Motion Prediction
For motion preidction, our SymGNN works as , using the backbone and a prediction module, where is the prediction submodel of SymGNN. Therefore, we additionally build a motionprediction head, whose functionality is to sequentially predict the future poses. Fig. 7 shows the overall structure. We adopt the selfregressive mechanics and identical connection in the motionprediction head, which utilizes gated recurrent unit (GRU) to model the temporal evolution.
Concretely, an MLP is first employed to embed the features of three action differences:
where denotes the fusing network of prediction task. Let be the initial states of GRUbased predictor and be the pose in the current time stamp. To produce the th pose , the motionprediction head works as
(10a)  
(10b)  
(10c) 
where , and represent JGC operator (5), GRU cell and output MLP, respectively. The following s are the predictions obtained sequentially and used recylingly. In Step (10a), we apply the JGC to update the hidden states; in Step (10b), we feed the updated hidden states, current pose and classified labels into the GRU cell to produce the features that reflect future displacement; In Step (10c), we add the predicted displacement to the previous pose to predict the next frame.
The motionprediction head has three advantages: (i) we use JGC to update hidden features, capturing more complicated motion patterns; (ii) we input multiple orders of poses differences and classified labels to the GRU, providing explicit motion priors; and (iii) Connected by the residual, the GRU and MLP predict the displacement for each frame; this makes predictions precise and robust.
5.4 MultiObjective Optimization
To train action recognition and motion prediction simultaneously, we consider a multiobjective scheme.
To recognize actions, we minimize the cross entropy between the groundtruth categorical labels and the inferred ones. Let the true label of the th sample be and the corresponding classification results be . For training samples in one minibatch, the action recognition loss is formulated as
(11) 
where denotes the transpose operation.
For motion prediction, we minimize the distance between the target motions and the predicted clips. Let the th target and predictions be and , for samples in one minibatch, the prediction loss is
(12) 
where denotes the norm. According to our experiments, the norm leads to more precise predictions compared to the common norm.
To integrate two losses for training, we propose a convex combination that weighted sums (11) and (12); that is
where tradeoffs the importances of two tasks. To balance action recognition and motion prediction, instead of using a fixed selected by hand, we employ the ‘multiplegradient descent algorithm’ (MGDA) to obtain the proper coefficient for multitasking loss terms. Following [51], the parameter is adaptively adjusted during training by matching one of KarushKuhnTucker (KKT) conditions of the optimization problem. Therefore, the optimized allocates weights for the two tasks adaptively, leading to better performances for both tasks. In our training scheme, all the model parameters are trained endtoend with the stochastic gradient descent algorithm [52]; see more details in Appendix.
5.5 Bonebased Dual Graph Neural Networks
While the joints contain some information of action representation from the joint aspect, the attributes of bones, such as lengths and orientations, are crucial to provide some complementary information. In this section, we construct a bonebased dual graph against original jointscale graph, whose vertices are bones and edges link bones.
To represent the feature of each bone, we compute the subtraction of two endpoint joints coordinates, which includes information of bone lengths and orientations. The subtraction order is from the centrifugal joint to the centripetal . Let the joint locations along time be , the bone attribute is . Then, we construct the bonebased dual actional and structural graphs to model the bone relations; and we also build the partscale dual graph. The dual actional graph is learned from bone features by AGIM (see section 4.1.1); for dual structural graph, the 1hop edges are linked when two bones with articulated joints and the highhop edges are extended from the 1hop edges; the partscale attributes are obtained by integrating bone attributes and the partscale graph is built according to body nature; The bonebased graphs are dual of jointscale graphs, which are employed to extract complementary bone features.
Given the bonebased graphs, we train a bonebased graph neural network. We input bone attributes; then everything else follows the jointbased network. Finally, we fuse the jointbased and bonebased recognition outputs before softmax functions by a weighted summation to calculate the classification results, which tend to be more accurate and improve motion prediction effectively.
6 Experiments and Analysis
In this section, we evaluate the proposed SymGNN. First, we introduce the datasets and model settings in detail; then, the performance comparisons between SymGNN and other stateoftheart methods are presented; and we finally show the ablation studies of our model.
6.1 Datasets and Model Setting
6.1.1 Dataset
We conduct extensive experiments on four largescale datasets: NTURGB+D [32], Kinetics [14], Human 3.6M [33] and CMU Mocap. The details is shown as follow.
NTURGB+D: NTURGB+D, containing skeleton action sequences completed by one or two performers and categorized into classes, is one of the largest datasets for 3D skeletonbased action recognition. It provides the 3D spatial coordinates of joints for each subject in an action. For method evaluation, two protocols are recommended: ‘CrossSubject’ (CS) and ‘CrossView’ (CV). In CS, samples performed by subjects are separated into the training set, and the rest belong to the test set. CV assigns data according to camera views, where training and test set have and samples, respectively.
Kinetics: Kinetics is a large dataset for human action analysis, containing over video clips. There are classes of actions. Due to only RGB videos, we obtain skeleton data by estimating joint locations on pixels with OpenPose toolbox [53]. The toolbox generates 2D pixel coordinates and confidence score for totally joints. We represent each joint as a threeelement feature vector: . For the multipleperson cases, we select the body with the highest average joint confidence in each sequence. Therefore, one clip with frames is transformed into a skeleton sequence with the dimension of .
Human 3.6M: Human 3.6M (H3.6M) is a large motion capture dataset and also receives increasing popularity. Seven subjects are performing classes of actions, where each subject has joints. We downsample all sequences by two. The models are trained on six subjects and tested on the specific clips of the th subject. Notably, the dataset provides the joint locations in angle space, and we transform them into exponential maps and only use the joints with nonzero values (actually joints).
CMU Mocap: CMU Mocap includes five major action categories, and each subject in CMU Mocap has joints, which are presented by angle positions. We use the same strategy presented in [31] to select the actions. Thus we choose eight actions: ‘Basketball’, ‘Basketball Signal’, ‘Directing Traffic’, ‘Jumping’, ‘Running’, ‘Soccer’, ‘Walking’ and ‘Washing Window’. We preprocess the data and compute the corresponding exponential maps with the same approach as we do for Human 3.6M dataset.
6.1.2 Model Setting and Implementation Details
The models are implemented with PyTorch 0.4.1. Since different datasets have distinctive patterns and complexities, we employ specific configurations of SymGNN networks on corresponding datasets for features learning.
For NTURGB+D and Kinetics, the backbone network of SymGNN contains JGTC blocks and PGTC blocks. In each three JGTC and PGTC blocks, the feature dimensions are respectively , and . The kernel size of TC is and it shrinks the temporal dimension with stride after the rd and th blocks, where we use bidirectional fusion mechanisms. . The actionrecognition head is a layer MLP, whose hidden dimension is . For the motionprediction head, the hidden dimensions of GRU and output MLP are . For the actional graph inference module (AGIM), we use layer D MLPs with ReLU, batch normalization and dropout in each iteration. We use SGD algorithm to train SymGNN, where the learning rate is initially and decays by every epochs. The model is trained with batch size for epochs on GTX1080Ti GPUs. For both NTURGB+D and Kinetics, the last 10 frames are used for motion prediction and other previous frames are fed into SymGNN for action recognition.
As for Human 3.6M and CMU Mocap, due to the simpler dynamics and fewer categories, we propose a light version of SymGNN, which extracts meaningful features with more shallow networks, improving efficiency for motion prediction. In the backbone, we use JGTC blocks and PGTC blocks, whose feature dimensions are , , and ; the temporal convolution strides in blocks are: , , , , respectively. We apply bidirectional fusions at the last 3 layers. . The recognition and motionprediction heads, as well as AGIM, leverage the same architecture as we set for NTURGB+D. We train the model using Adam optimizer with the learning rate and batch size for iterations on one GTX1080Ti GPU. All the hyperparameters are selected using a validation set.
6.2 Comparison with StateoftheArts
On the three largescale skeletonformed datasets, we compare the proposed SymGNN with stateoftheart methods for human action recognition and motion prediction.
6.2.1 3D Skeletonbased Action Recognition
For action recognition, we first show the classification accuracies of SymGNN and baselines on two recommended benchmarks of NTURGB+D, i.e. CrossSubject and CrossView [32]. The stateoftheart models are based on manifold analysis [10], recurrent neural networks [12, 32, 28], convolution networks [24, 13, 9], and graph networks [14, 54, 28, 35, 29, 36, 37, 1, 38]. Moreover, to investigate different components of SymGNN, such as multiple graphs and multitasking, we test several model variants, including SymGNN using only jointscale structural graphs (Only JS), only jointscale actional graphs (Only JA), onlypart scale graph (Only P), no bonebased dual graphs (No bone), no prediction for multitasking (No pred) and complete model. Table I presents recognition accuracies of methods.
Methods  CS  CV 

Lie Group [10]  50.1%  52.8% 
HRNN [12]  59.1%  64.0% 
Deep LSTM [32]  60.7%  67.3% 
PALSTM [32]  62.9%  70.3% 
STLSTM+TS [34]  69.2%  77.7% 
Temporal Conv [24]  74.3%  83.1% 
Visualize CNN [13]  76.0%  82.6% 
CCNN+MTLN  79.6%  84.8% 
STGCN [14]  81.5%  88.3% 
DPRL [54]  83.5%  89.8% 
SRTSL [28]  84.8%  92.4% 
HCN [9]  86.5%  91.1% 
STGRGCN [37]  86.9%  92.3% 
motifGCN [38]  84.2%  90.2% 
ASGCN [1]  86.8%  94.2% 
2sAGCN [35]  88.5%  95.1% 
AGCLSTM [29]  89.2%  95.0% 
DGNN [36]  89.9%  96.1% 
SymGNN (Only JS)  88.3%  94.5% 
SymGNN (Only JA)  85.7%  93.7% 
SymGNN (Only P)  86.5%  87.3% 
SymGNN (No bone)  87.1%  93.8% 
SymGNN (No pred)  89.0%  95.7% 
SymGNN  90.1%  96.4% 
Methods  Top1  Top5 

Feature Encoding [11]  14.9%  25.8% 
Deep LSTM [32]  16.4%  35.3% 
Temporal Conv [24]  20.3%  40.0% 
STGCN [14]  30.7%  52.8% 
STGRGCN [37]  33.6%  56.1% 
ASGCN [1]  34.8%  56.3% 
2sAGCN [35]  36.1%  58.7% 
DGNN [36]  36.9%  56.9% 
SymGNN (No pred)  36.4%  57.4% 
SymGNN  37.2%  58.1% 
Human 3.6M  CMU Mocap  
Methods  Top1  Top5  Top1  Top5 
STGCN [14]  40.2%  78.4%  87.5%  96.9% 
HCN [9]  47.6%  88.8%  95.4%  99.2% 
2sAGCN [35]  55.4%  94.1%  97.1%  99.8% 
SymGNN (Only J)  55.6%  93.9%  96.5%  99.4% 
SymGNN (Only P)  54.3%  93.1%  94.9%  98.0% 
SymGNN (No bone)  53.5%  93.2%  93.5%  95.8% 
SymGNN (No pred)  55.2%  94.1%  96.6%  99.4% 
SymGNN  56.5%  95.3%  98.8%  100% 
We see that the complete SymGNN, which utilizes jointscale and partscale graphs, motion motionprediction head and dual bonebased network, outperforms the baselines on both benchmarks. The results reveal that richer joint relations promote to capture more useful patterns, and additional motion prediction and complementary bonebased features improve the discrimination.
Benchmarks  Future frames  1  2  3  4  5  6  7  8  9  10  Average 

CrossSubject  ZeroV [17]  70.32  52.77  32.64  23.19  18.05  14.11  11.92  9.84  8.35  6.04  24.72 
Ressup [17]  76.25  55.96  40.31  29.47  22.81  16.96  13.65  11.57  10.13  8.87  28.60  
CSM [31]  82.38  68.11  56.84  45.26  38.65  30.41  26.15  22.74  17.52  15.96  40.40  
SkelTNet [30]  93.62  86.44  81.03  75.85  70.81  66.57  59.60  54.45  46.92  40.18  67.55  
SymGNN (No recg)  98.74  97.07  94.95  93.94  93.48  91.79  90.69  89.27  87.87  86.01  92.34  
SymGNN  99.00  97.65  95.89  95.10  93.44  92.70  91.75  90.65  89.54  89.10  93.48  
CrossView  ZeroV [17]  75.69  55.72  39.88  29.60  21.91  15.23  12.06  10.18  8.70  7.33  27.63 
Ressup [17]  78.85  59.91  43.82  32.37  24.32  18.51  14.86  12.29  10.38  8.85  30.42  
CSM [31]  85.41  71.75  58.20  46.69  39.07  31.85  28.43  24.17  19.66  18.93  42.62  
SkelTNet [30]  94.81  89.12  83.85  79.00  72.74  69.11  62.39  66.97  48.88  42.70  70.66  
SymGNN (No recg)  98.99  96.79  95.55  94.68  93.03  92.13  90.88  89.69  88.70  87.57  92.79  
SymGNN  99.25  97.87  96.38  95.21  94.06  92.95  91.94  91.01  90.18  89.27  93.81 
Motion  Walking  Eating  Smoking  Discussion  

milliseconds  80  160  320  400  80  160  320  400  80  160  320  400  80  160  320  400 
ZeroV [17]  0.39  0.68  0.99  1.15  0.27  0.48  0.73  0.86  0.26  0.48  0.97  0.95  0.31  0.67  0.94  1.04 
ERD [42]  0.93  1.18  1.59  1.78  1.27  1.45  1.66  1.80  1.66  1.95  2.35  2.42  2.27  2.47  2.68  2.76 
Lstm3LR [42]  0.77  1.00  1.29  1.47  0.89  1.09  1.35  1.46  1.34  1.65  2.04  2.16  1.88  2.12  2.25  2.23 
SRNN [16]  0.81  0.94  1.16  1.30  0.97  1.14  1.35  1.46  1.45  1.68  1.94  2.08  1.22  1.49  1.83  1.93 
DropAE [55]  1.00  1.11  1.39  /  1.31  1.49  1.86  /  0.92  1.03  1.15  /  1.11  1.20  1.38  / 
Samploss [17]  0.92  0.98  1.02  1.20  0.98  0.99  1.18  1.31  1.38  1.39  1.56  1.65  1.78  1.80  1.83  1.90 
Ressup [17]  0.27  0.46  0.67  0.75  0.23  0.37  0.59  0.73  0.32  0.59  1.01  1.10  0.30  0.67  0.98  1.06 
CSM [31]  0.33  0.54  0.68  0.73  0.22  0.36  0.58  0.71  0.26  0.49  0.96  0.92  0.32  0.67  0.94  1.01 
TPRNN [56]  0.25  0.41  0.58  0.65  0.20  0.33  0.53  0.67  0.26  0.47  0.88  0.90  0.30  0.66  0.96  1.04 
QuaterNet [25]  0.21  0.34  0.56  0.62  0.20  0.35  0.58  0.70  0.25  0.47  0.93  0.90  0.26  0.60  0.85  0.93 
AGED [18]  0.21  0.35  0.55  0.64  0.18  0.28  0.50  0.63  0.27  0.43  0.81  0.83  0.26  0.56  0.77  0.84 
BiHMPGAN [27]  0.33  0.52  0.63  0.67  0.20  0.33  0.54  0.70  0.26  0.50  0.91  0.86  0.33  0.65  0.91  0.95 
SkelTNet [30]  0.31  0.50  0.69  0.76  0.20  0.31  0.53  0.69  0.25  0.50  0.93  0.89  0.30  0.64  0.89  0.98 
VGRUr1 [57]  0.34  0.47  0.64  0.72  0.27  0.40  0.64  0.79  0.36  0.61  0.85  0.92  0.46  0.82  0.95  1.21 
SymGNN (Only JA)  0.19  0.35  0.54  0.63  0.18  0.34  0.54  0.66  0.23  0.43  0.84  0.82  0.26  0.62  0.81  0.87 
SymGNN (Only JS)  0.19  0.33  0.54  0.69  0.17  0.32  0.52  0.66  0.21  0.41  0.83  0.82  0.24  0.64  0.93  1.01 
SymGNN (No recg)  0.18  0.31  0.50  0.59  0.16  0.29  0.49  0.61  0.21  0.40  0.80  0.80  0.22  0.57  0.85  0.93 
SymGNN  0.17  0.31  0.50  0.60  0.16  0.29  0.48  0.60  0.21  0.40  0.76  0.80  0.21  0.55  0.77  0.85 
Then, we evaluate our model for action recognition on Kinetics and compare it with six previous models, including a handcrafted based method, Feature Encoding [11], two deep models, Deep LTSM [32] and Temporal ConvNet [24], and three graphbased methods, STGCN [14], 2sAGCN [35], and DGNN [36]. Table II shows the top1 and top5 classification results, and ‘no pred’ denotes the SymGNN variant without motionprediction head and multitasking framework. We see that SymGNN outperforms other methods on top1 recognition accuracy and achieves competitive results on top5 recognition accuracy.
Additionally, we evaluate our model for action recognition on Human 3.6M and CMU Mocap. Table III presents the top1 and top5 classification accuracies for both two datasets. Here we compare SymGNN with a few recently proposed methods: STGCN [14], HCN [9], and 2sAGCN [35]. We also show the effectiveness of our model.
Motion  Directions  Greeting  Phoning  Posing  Purchases  Sitting  

millisecond  80  160  320  400  80  160  320  400  80  160  320  400  80  160  320  400  80  160  320  400  80  160  320  400 
ZeroV [17]  0.39  0.59  0.79  0.89  0.54  0.89  1.30  1.49  0.64  1.21  1.65  1.83  0.28  0.57  1.13  1.37  0.62  0.88  1.19  1.27  0.40  1.63  1.02  1.18 
Ressup [17]  0.41  0.64  0.80  0.92  0.57  0.83  1.45  1.60  0.59  1.06  1.45  1.60  0.45  0.85  1.34  1.56  0.58  0.79  1.08  1.15  0.41  0.68  1.12  1.33 
CSM [31]  0.39  0.60  0.80  0.91  0.51  0.82  1.21  1.38  0.59  1.13  1.51  1.65  0.29  0.60  1.12  1.37  0.63  0.91  1.19  1.29  0.39  0.61  1.02  1.18 
TPRNN [56]  0.38  0.59  0.75  0.83  0.51  0.86  1.27  1.44  0.57  1.08  1.44  1.59  0.42  0.76  1.29  1.54  0.59  0.82  1.12  1.18  0.41  0.66  1.07  1.22 
AGED [18]  0.23  0.39  0.62  0.69  0.54  0.80  1.29  1.45  0.52  0.96  1.22  1.43  0.30  0.58  1.12  1.33  0.46  0.78  1.00  1.07  0.41  0.75  1.04  1.19 
SkelTNet [30]  0.36  0.58  0.77  0.86  0.50  0.84  1.28  1.45  0.58  1.12  1.52  1.64  0.29  0.62  1.19  1.44  0.58  0.84  1.17  1.24  0.40  0.61  1.01  1.15 
SymGNN (No recg)  0.24  0.45  0.61  0.67  0.36  0.61  0.98  1.17  0.50  0.86  1.29  1.43  0.18  0.44  0.99  1.22  0.40  0.62  1.00  1.08  0.23  0.41  0.80  0.97 
SymGNN  0.23  0.42  0.57  0.65  0.35  0.60  0.95  1.15  0.48  0.80  1.28  1.41  0.18  0.45  0.97  1.20  0.40  0.60  0.97  1.04  0.24  0.41  0.77  0.95 
Motion  Sitting Down  Taking Photo  Waiting  Walking Dog  Walking Together  Average  
millisecond  80  160  320  400  80  160  320  400  80  160  320  400  80  160  320  400  80  160  320  400  80  160  320  400 
ZeroV [17]  0.39  0.74  1.07  1.19  0.25  0.51  0.79  0.92  0.34  0.67  1.22  1.47  0.60  0.98  1.36  1.50  0.33  0.66  0.94  0.99  0.39  0.77  1.05  1.21 
Ressup. [17]  0.47  0.88  1.37  1.54  0.28  0.57  0.90  1.02  0.32  0.63  1.07  1.26  0.52  0.89  1.25  1.40  0.27  0.53  0.74  0.79  0.40  0.69  1.04  1.18 
CSM [31]  0.41  0.78  1.16  1.31  0.23  0.49  0.88  1.06  0.30  0.62  1.09  1.30  0.59  1.00  1.32  1.44  0.27  0.52  0.71  0.74  0.38  0.68  1.01  1.13 
TPRNN [56]  0.41  0.79  1.13  1.27  0.26  0.51  0.80  0.95  0.30  0.60  1.09  1.28  0.53  0.93  1.24  1.38  0.23  0.47  0.67  0.71  0.37  0.66  0.99  1.11 
AGED [18]  0.33  0.61  0.97  1.08  0.23  0.48  0.81  0.95  0.25  0.50  1.02  1.12  0.50  0.82  1.15  1.27  0.23  0.42  0.56  0.63  0.33  0.58  0.94  1.01 
SkelTNet [30]  0.37  0.72  1.05  1.17  0.24  0.47  0.78  0.93  0.30  0.63  1.17  1.40  0.54  0.88  1.20  1.35  0.27  0.53  0.68  0.74  0.36  0.64  0.99  1.02 
SymGNN (No recg)  0.30  0.62  0.91  1.03  0.16  0.34  0.55  0.66  0.22  0.49  0.89  1.09  0.42  0.74  1.09  1.25  0.17  0.34  0.52  0.57  0.26  0.50  0.82  0.94 
SymGNN  0.28  0.60  0.89  0.99  0.14  0.32  0.53  0.64  0.22  0.48  0.87  1.06  0.42  0.73  1.08  1.22  0.16  0.33  0.50  0.56  0.26  0.49  0.79  0.92 
Motion  Basketball  Basketball Signal  Directing Traffic  Jumping  

milliseconds  80  160  320  400  1000  80  160  320  400  1000  80  160  320  400  1000  80  160  320  400  1000 
Ressup [17]  0.49  0.77  1.26  1.45  1.77  0.42  0.76  1.33  1.54  2.17  0.31  0.58  0.94  1.10  2.06  0.57  0.86  1.76  2.03  2.42 
Resuns [17]  0.53  0.82  1.30  1.47  1.81  0.44  0.80  1.35  1.55  2.17  0.35  0.62  0.95  1.14  2.08  0.59  0.90  1.82  2.05  2.46 
CSM [31]  0.37  0.62  1.07  1.18  1.95  0.32  0.59  1.04  1.24  1.96  0.25  0.56  0.89  1.00  2.04  0.39  0.60  1.36  1.56  2.01 
BiHMPGAN [27]  0.37  0.62  1.02  1.11  1.83  0.32  0.56  1.01  1.18  1.89  0.25  0.51  0.85  0.96  1.95  0.39  0.57  1.32  1.51  1.94 
SkelTNet [30]  0.35  0.63  1.04  1.14  1.78  0.24  0.40  0.69  0.80  1.07  0.22  0.44  0.78  0.90  1.88  0.35  0.53  1.28  1.49  1.85 
SymGNN (No recg)  0.33  0.48  0.95  1.09  1.47  0.15  0.26  0.47  0.56  1.04  0.20  0.41  0.77  0.89  1.95  0.32  0.55  1.40  1.60  1.87 
SymGNN  0.32  0.48  0.91  1.06  1.47  0.12  0.21  0.38  0.49  0.94  0.20  0.41  0.75  0.87  1.84  0.32  0.55  1.40  1.60  1.82 
Motion  Running  Soccer  Walking  Washing Window  
milliseconds  80  160  320  400  1000  80  160  320  400  1000  80  160  320  400  1000  80  160  320  400  1000 
Ressup [17]  0.32  0.48  0.65  0.74  1.00  0.29  0.50  0.87  0.98  1.73  0.35  0.45  0.59  0.64  0.88  0.32  0.47  0.74  0.93  1.37 
Resuns [17]  0.35  0.50  0.69  0.76  1.04  0.31  0.51  0.90  1.00  1.77  0.36  0.47  0.62  0.65  0.93  0.33  0.47  0.75  0.95  1.40 
CSM [31]  0.28  0.41  0.52  0.57  0.67  0.26  0.44  0.75  0.87  1.56  0.35  0.44  0.45  0.50  0.78  0.30  0.47  0.80  1.01  1.39 
BiHMPGAN [27]  0.28  0.40  0.50  0.53  0.62  0.26  0.44  0.72  0.82  1.51  0.35  0.45  0.44  0.46  0.72  0.31  0.46  0.77  0.92  1.31 
SkelTNet [30]  0.38  0.48  0.57  0.62  0.71  0.24  0.41  0.69  0.79  1.44  0.33  0.41  0.45  0.48  0.73  0.31  0.46  0.79  0.96  1.37 
SymGNN (No recg)  0.21  0.33  0.53  0.56  0.66  0.22  0.38  0.72  0.83  1.38  0.26  0.32  0.38  0.41  0.54  0.22  0.33  0.62  0.83  1.07 
SymGNN  0.21  0.33  0.53  0.56  0.65  0.19  0.32  0.66  0.78  1.32  0.26  0.32  0.35  0.39  0.52  0.22  0.33  0.55  0.73  1.05 
Notably, for Human 3.6M, there is a relatively large gap between the top1 and top5 accuracies, because the input motions are some fragmentary clips of long sequences with incomplete semantics and activities have subtle differences (e.g. ‘Eating’ and ‘Smoking’ are similar). In other words, SymGNN learns the common features and provides reasonable discrimination, resulting in high top5 accuracy; but it confuses in nonsemantic variances, causing not high top1 accuracy. However, CMU Mocap has more distinctive actions, where we obtain high classification accuracies.
6.2.2 3D Skeletonbased Motion Prediction
To validate the model for predicting future motions, we train the SymGNN on NTURGB+D, Human 3.6M, and CMU Mocap. There are two specific tasks: shortterm and longterm motion prediction. Concretely, the target of shortterm prediction is commonly to predict poses within 400 milliseconds, while the longterm prediction aims to predict poses in 1000 ms or longer. To reveal the effectiveness of SymGNN, we introduce many stateoftheart methods, which learned dynamics from pose vectors [42, 17, 56, 18, 27] or separate bodyparts [16, 31, 30]. We also introduce a naive baseline, named ZeroV [17], which sets all predictions to be the last observed frame.
Shortterm motion prediction: We validate SymGNN on two datasets: NTURGB+D and Human 3.6M. For NTURGB+D, we train SymGNN to generate future 10 frames for each input sequence. We compare our SymGNN with several previous methods and the SymGNN variant which abandons auxiliary actionrecognition head (No recg). As the metric, We use the percentage of correct points within a normalized region 0.05 ([email protected]), that is, a joint is counted as correctly predicted if the normalized distance between the predicted location and groundtruth is less than 0.05. The [email protected] of different models are presented in Table IV. We see: 1) our model extremely outperforms the baselines with a large margin especially for the longer term; 2) Using action recognition and motion prediction together obtains the highest [email protected] along time, demonstrating the enhancements from recognition task for dynamics learning.
Then, we compare SymGNN to baselines for shortterm prediction on Human 3.6M, where the models generate poses up to the future 400 ms. We analyze several variants of SymGNN with different components, including using only jointscale actional graphs (Only JA) or jointscale structural graphs (Only JS), as well as no recognition task (No recg). As another metric, the mean angle errors (MAE) between the predictions and the ground truths are computed, representing the errors from predicted poses to targets in angle space. We first test representative actions: ‘Walking’, ‘Eating’, ‘Smoking’ and ‘Discussion’. Table V shows MAEs of different methods that predict motions up to 400 ms. As we see, when SymGNN simultaneously employs multiple graphs and multitasking, our method outperforms all the baselines and its own ablations.
We also test SymGNN on the remaining actions in Human 3.6M, where the MAEs of some recent methods are shown in Table VI. SymGNN also achieves the best performance on most actions and the lowest average MAE on motions. Although the mentioned top1 classification accuracy on this dataset is not very high (see Table III), we note that the estimated soft labels cover the common motion factors, resulting in high top5 recognition accuracy. For example, people walk in ‘Walking’, ‘Walking Dog’ and ‘Walking Together’, and we need the walking factors instead of the specific labels for motion generation. Given the soft labels, the model tends to obtain precise predictions.
Longterm motion prediction: For longterm prediction, the SymGNN is tested on Human 3.6M and CMU Mocap. We predict the future poses up to 1000 millisecond. It is challenging due to action variation and nonlinearity [17]. Table VIII presents the MAEs of various models for predicting the 4 motions in Human 3.6M at the future 560 ms and 1000 ms.
Motion  Walking  Eating  Smoking  Discussion  

milliseconds  560  1k  560  1k  560  1k  560  1k 
ZeroV [17]  1.35  1.32  1.04  1.38  1.02  1.69  1.41  1.96 
ERD [42]  2.00  2.38  2.36  2.41  3.68  3.82  3.47  2.92 
Lstm3LR [42]  1.81  2.20  2.49  2.82  3.24  3.42  2.48  2.93 
SRNN [16]  1.90  2.13  2.28  2.58  3.21  3.23  2.39  2.43 
DropAE [55]  1.55  1.39  1.76  2.01  1.38  1.77  1.53  1.73 
Ressup. [17]  0.93  1.03  0.95  1.08  1.25  1.50  1.43  1.69 
CSM [31]  0.86  0.92  0.89  1.24  0.97  1.62  1.44  1.86 
TPRNN [56]  0.74  0.77  0.84  1.14  0.98  1.66  1.39  1.74 
AGED [18]  0.78  0.91  0.86  0.93  1.06  1.21  1.25  1.30 
BiHMPGAN [27]  /  0.85  /  1.20  /  1.11  /  1.77 
SkelTNet [30]  0.79  0.83  0.84  1.06  0.98  1.21  1.19  1.75 
SymGNN  0.75  0.78  0.77  0.88  0.92  1.18  1.17  1.28 
We see that SymGNN outperforms the competitors on ‘Eating’, ‘Smoking’ and ‘Discussion’, and obtain competitive results on ‘Walking’.
To further evaluate SymGNN, we conduct longterm prediction on eight classes of actions in CMU Mocap. We present the MAEs of SymGNN with or without using the actionrecognition head. Table VII shows the predicting MAEs ranging from future 80 ms to 1000 ms. We note that we train the model for longterm prediction, where the ‘shortterm’ MAEs are the intermediate results during predicting up to 1000 ms. We see that SymGNN significantly outperforms the stateoftheart methods on actions ‘Basketball’, ‘Basketball Signal’ and ‘Washing Window’, and obtains competitive performance on ‘Jumping’ and ‘Running’.
Effectivenessefficiency tradeoff: We also compare the prediction errors and efficiency of various models, because the high response speed and precise generation are both essential for realtime motion prediction. Notably, the AGIM propagates the features between joints and edges iteratively. The iteration times trades off between effectiveness and speed; i.e. larger leads to a lower MAE but slower speed. To represent the running speed, we use the generated frame numbers in each ms (frame period) when we predict up to ms. We tune and compare SymGNN to other methods on Human 3.6M and show the running speeds and MAEs for prediction in 400 ms.
Fig. 8 shows the effectivenessefficiency tradeoff, where the axis is the generated frame numbers in ms (reflecting prediction speed) and the axis is the MAE. Different red circles denote different numbers of iterations in AGIM, i.e. from the rightmost circle to the leftmost one, . We see that the proposed SymGNN is both faster and more precise compared to its competitors.
6.3 Ablation Studies
6.3.1 Symbiosis of Recognition and Prediction
To analyze the mutual effects of action recognition and motion prediction, we conduct several experiments.
Noise ratio  CS  CV 

No pred 
Motion  Average  

Noise ratio  80 ms  160 ms  320 ms  400 ms 
No recg 
We first study the effects on action recognition from motion prediction. We use accurate class labels but noisy future poses to train the multitasking SymGNN for action recognition. To represent noisy supervisions, we randomly shuffle a percentage of targets motions among training data. Table IX presents the recognition accuracies with various ratios of noisy prediction targets on two benchmarks of NTURGB+D. We also show the recognition results of the model without motionprediction head. We see that 1) the predicted head benefits the actionrecognition head. Introducing a motionprediction head is beneficial even when the noise ratio is around ; 2) when the noise ratio exceeds , the recognition performance tends to be slightly worse than that of the model without the motionprediction head, reflecting that the action recognition is robust against the deflected motion prediction. Consequently, we show that motion prediction strengthens action recognition.
On the other hand, we test how confused recognition results affect motion prediction by using noisy action categories. Following Table IX, we shuffle training labels to represent categorical noise. Table X presents the average MAEs for shortterm prediction with noisy action labels on Human 3.6M. We demonstrate that an accurate actionrecognition head helps effective motion prediction.
We finally test the promotion on recognition when the observed data is limited, where we intercept the early motions by a ratio (e.g. ) for action recognition. There are three models with various prediction strategies: 1) predicting the future 10 frames (‘Pred 10 frames’); 2) predicting all future frames (‘Pred all frames’); 3) no prediction (‘No pred’). Fig. 9 illustrates the recognition accuracies of three models on different observation ratios. As we see, when the observation ratio is low, ‘Pred all frames’ can be aware of the entire action sequences and capture richer dynamics, showing the best performance; when the observation ratio is high, predicting 10 or all frames are similar because the inputs carry sufficient patterns, but they outperform ‘No pred’ as they preserve information, showing enhancement. By introducing the motionprediction head, our SymGNN has the potential for action classification in the early period.
6.3.2 Effects of Graphs
In this section, we study the abilities of various graphs, namely, only jointscale structural graphs (Only JS), only jointscale actional graph (Only JA), only partscale graph (Only P), and combining them (full).
For action recognition, we train SymGNN on NTURGB+D, CrossSubject and investigate different graph configurations. While involving jointscale structural graph, we respectively set the number of hop in the jointscale structural graphs (JSHop) to be . Note that when we use only jointscale structural graph with , the corresponding graph is exactly the skeleton itself. Table XI presents the results of SymGNN with different graph components for action recognition.
JSHop ()  Only JS  Only JA  Only P  full 

1  
2  
3  
4 
We see that 1) representing longrange structural relations, higher leads to more effective action recognition; 2) combining the multiple graphs introduced from different perspectives improves the action recognition performance significantly.
For motion prediction, we study graphs components using similar setting of Table XI. We validate SymGNN on Human 3.6M, the average shortterm prediction MAEs are presented in Table XII.
JSHop ()  Only JS  Only JA  Only P  full 

1  0.622  0.618  0.615  0.616 
2  0.619  0.615  
3  0.613  0.611  
4  0.618  0.614 
We see that the effects multiple relations for promoting motion prediction are demonstrated. We note that too large introduces redundancy and confusing relations, enlarging the prediction error.
Difference Order  CS  CV 

Motion  Average  

Milliseconds  80  160  320  400 
0.33  0.59  0.85  0.91  
0.28  0.52  0.81  0.85  
0.26  0.49  0.79  0.82 
6.3.3 Balance JointScale Actional and Structural Graphs
In our model, we present that the power of jointscale actional and structural graphs in JGC operator are traded off by a hyperparameter (see (5)). Here we analyze how affects the model performances.
For action recognition, we test our model on NTURGB+D, CrossSubject, and present the classification accuracies with different ; for motion prediction, we show the average MAEs for shortterm prediction. Fig. 10 illustrates the model performances for both tasks.
We see: 1) when , we obtain the highest recognition accuracies, showing large improvements than cases with other ; 2) for motion prediction, the performance is robust against different , where the MAEs fluctuate around , but and lead to the lowest errors.
6.3.4 Highorder Difference
Here we study the highorder differences of the input actions for action recognition and motion prediction, which help to capture richer motion dynamics. In our model, the difference orders are considered to be , reflecting positions, velocities, and accelerations. We present the results of action recognition on NTURGB+D and the results of motion prediction on Human 3.6M.
For action recognition, we test SymGNN with , and on the two benchmarks (CS & CV) of NTURGB+D dataset. Table XIII presents the average recognition accuracies of SymGNN with three difference configurations. We see that leads to the highest classification accuracies on both benchmarks, indicating that positions and velocities capture comprehensive movement patterns to provide rich semantics, meanwhile the model complexity is not very high.
For motion prediction, we feed the model with various action differences of Human 3.6M to generate future poses in within 400 ms. We obtain the predicting average MAEs on the timestamps of 80, 160, 320 and 400 ms. The results are presented in Table XIV. We see that SymGNN forecasts the future poses with the lowest prediction errors when we feed the action differences with , showing the effectiveness of combining highorder motion states.
Parallel Network  CS  CV 

Only Joint  
Only Bone  
Joint & Bone 
6.3.5 Bonebased Dual Graph Neural Networks
We validate the effectiveness of using dual networks which take joint and bone features as inputs for action recognition, respectively. Table XV presents the recognition accuracies for different combinations of jointbased and bonebased dual networks on two benchmarks of NTURGB+D dataset. We see that only using joint features or bone features for action recognition cannot obtain the most accurate recognition, but combining joint and bone features could improve the classification performances with a large margin, indicating the complementary information carried by the two networks.
6.4 Visualization
In this section, we visualize some representations of SymGNN, including the learned jointscale actional graphs and their low dimensional manifolds. Moreover, we show some predicted motions to evaluate model qualitatively.
6.4.1 JointScale Actional Graphs
We first show the learned jointscale actional graphs on four motions in Human 3.6M. Fig. 11 illustrates the edges which have the top15 largest weights in each graph, indicating the 15 strongest actionbased relations associated with different motions.
We see: 1) The jointscale actional graphs capture some actionbased longrange relations beyond direct boneconnections; 2) Some reasonable relations are captured, e.g. for ‘Directions’, the stretched arms are correlated to other joints; 3) for motions with the same category, we tend to obtain the similar graphs; see two plots of ‘Walking’, while different classes of motions have distinct actional graphs; see ‘Walking’ and the other motions, where the model learns the discriminative patterns from data.
6.4.2 Manifolds of JointScale Actional Graphs
To verify how discriminative the patterns embedded in the jointscale actional graphs, we visualize the lowdimension manifolds of different jointscale actional graphs. We select 8 representative classes of actions in Human 3.6M and sample more clips from long test motion sequences. Here we treat all the jointscale actional graphs as vectors and obtain their 2D TSNE map; see Fig. 12. We see that ‘Walking’, ‘Walking Dog’ and ‘Walking Together’, which have the common walking dynamics, are distributed closely, as well as ‘Sitting’ and ‘Sitting Down’ are clustered; however, walkingrelated actions and sittingrelated actions are separated with a large margin; as for ‘Eating’, ‘Smoking’ and ‘Taking Photo’, they have similar movements on arms, showing a new cluster.
6.4.3 Predicted Sequences
Finally, we compare the generated samples of SymGNN to those of Ressup [17], AGED [18], and CSM [31] on Human 3.6M and CMU Mocap. Fig. 13 illustrates the future poses of ‘Eating’ and ‘Running’ in 1000 ms with the frame interval of 80 ms, where plot (a) shows the predictions of ‘Eating’ in Human 3.6M and plot (b) visualize ‘Running’ in CMU Mocap. Comparing to baselines, we see that SymGNN provides significantly better predictions. The poses generated by Ressup has large errors after the 600th ms (two orange boxes); AGED produces over movements for the downward hand in longterm (red box in plot (a)); CSM gives tortile poses in longterm (red box in plot (b)). But SymGNN completes the action accurately and reasonably.
6.5 Stability Analysis: Robustness against Input Perturbation
According to Theorem 4.1.3, we present that SymGNN is robust against perturbation on inputs, where we calculate an upper bound of output deviation. To verify the stability, we add Gaussian noises sampled from on input actions. We show the recognition accuracies on NTURGB+D (CrossSubject) and shortterm prediction MAEs on Human 3.6M with standard deviation varied from to . The recognition/prediction performances with different are illustrated in Fig. 14. We see: 1) for action recognition, SymGNN stays a high accuracy when the noise has , but it tends to deteriorate due to severe perturbation when ; 2) for motion prediction, SymGNN produces precise poses when the noise has , but the prediction performance is degraded for larger . In all, SymGNN is robust against small perturbation.
Given two inputs and , which satisfy , we validate the assumption of claimed in Theorem 4.1.3. We calculate the ratio between the perturbations of responses and inputs; that is,
Similar to Fig. 14, we tune the standard deviations of input noises, obtaining the corresponding and calculate the perturbation ratios. Fig. 15 illustrated the with different noise standard deviations. We see that the is steady at around for adjusted from to , which indicates the amplify factor and the responses and inputs are mostly linearly correlated. In other words, the actional graph inference module does not amplify the perturbation and the stability of JGC is still preserved.
7 Conclusions
In this paper, we propose a novel symbiotic graph neural network (SymGNN), which handles action recognition and motion prediction jointly and use graphbased operations to capture action patterns. Our model consists of a backbone, an actionrecognition head, and a motionprediction head, where the two heads enhance each other. As building components in the backbone and the motionprediction head, graph convolution operators based on learnable jointscale and partscale graphs are used to extract spatial information. We conduct extensive experiments for action recognition and motion prediction with four datasets, NTURGB+D, Kinetics, Human 3.6M, and CMU Mocap. Experiments show that our model achieves consistently improvements compared to the previous methods.
References
 [1] M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, “Actionalstructural graph convolutional networks for skeletonbased action recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
 [2] U. Gaur, Y. Zhu, B. Song, and A. RoyChowdhury, “A âstring of feature graphsâ model for recognition of complex activities in natural videos,” in The IEEE International Conference on Computer Vision (ICCV), November 2011, pp. 2595–2602.
 [3] D. Huang and K. Kitani, “Actionreaction: Forecasting the dynamics of human interaction,” in The European Conference on Computer Vision (ECCV), July 2014, pp. 489–504.
 [4] L. Gui, K. Zhang, Y. Wang, X. Liang, J. Moura, and M. Veloso, “Teaching robots to predict human motion,” in IEEE International Conference on Intelligent Robots and Systems (IROS), October 2018.
 [5] S. Mathe and C. Sminchisescu, “Actions in the eye: Dynamic gaze datasets and learnt saliency models for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 7, pp. 1408–1424, July 2015.
 [6] Y. Shi, B. Fernando, and R. Hartley, “Action anticipation with rbf kernelized feature mapping rnn,” in The European Conference on Computer Vision (ECCV), September 2018, pp. 301–317.
 [7] D. Wang, W. Ouyang, W. Li, and D. Xu, “Dividing and aggregating network for multiview action recognition,” in The European Conference on Computer Vision (ECCV), September 2018, pp. 451–476.
 [8] X. Liang, K. Gong, X. Shen, and L. Lin, “Look into person: Joint body parsing pose estimation network and a new benchmark,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 4, pp. 871–885, April 2019.
 [9] C. Li, Q. Zhong, D. Xie, and S. Pu, “Cooccurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation,” in IJCAI, July 2018, pp. 786–792.
 [10] R. Vemulapalli, F. Arrate, and R. Chellappa, “Human action recognition by representing 3d skeletons as points in a lie group,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014, pp. 588–595.
 [11] B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and T. Tuytelaars, “Modeling video evolution for action recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 5378–5387.
 [12] Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 1110–1118.
 [13] M. Liu, H. Liu, and C. Chen, “Enhanced skeleton visualization for view invariant human action recognition,” in Pattern Recognition, vol. 68, August 2017, pp. 346–362.
 [14] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeletonbased action recognition,” in AAAI Conference on Artificial Intelligence, February 2018, pp. 7444–7452.
 [15] A. Lehrmann, P. Gehler, and S. Nowozin, “Efficient nonlinear markov models for human motion,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014, pp. 1314–1321.
 [16] A. Jain, A. Zamir, S. Savarese, and A. Saxena, “Structuralrnn: Deep learning on spatiotemporal graphs,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016, pp. 5308–5317.
 [17] J. Martinez, M. Black, and J. Romero, “On human motion prediction using recurrent neural networks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pp. 4674–4683.
 [18] L. Gui, Y. Wang, X. Liang, and J. Moura, “Adversarial geometryaware human motion prediction,” in The European Conference on Computer Vision (ECCV), September 2018, pp. 786–803.
 [19] J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Mining actionlet ensemble for action recognition with depth cameras,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), June 2012, pp. 1290–1297.
 [20] M. Hussein, M. Torki, M. Gowayyed, and M. ElSaban, “Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations,” in IJCAI, August 2013, pp. 2466–2472.
 [21] J. Wang, A. Hertzmann, and D. Fleet, “Gaussian process dynamical models,” in Advances in Neural Information Processing Systems (NeurIPS), December 2006, pp. 1441–1448.
 [22] G. Taylor, G. Hinton, and S. Roweis, “Modeling human motion using binary latent variables,” in Advances in Neural Information Processing Systems (NeurIPS), Decemeber 2007.
 [23] I. Sutskever, G. E. Hinton, and G. W. Taylor, “The recurrent temporal restricted boltzmann machine,” in Advances in Neural Information Processing Systems (NeurIPS), 2009, pp. 1601–1608.
 [24] T. Kim and A. Reiter, “Interpretable 3d human action analysis with temporal convolutional networks,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), July 2017, pp. 1623–1631.
 [25] D. Pavllo, D. Grangier, and M. Auli, “Quaterne