Video Prediction Benchmarks¶

We provide benchmark results of spatiotemporal prediction learning (STL) methods on various video prediction datasets. More STL methods will be supported in the future. Issues and PRs are welcome! Currently, we only provide benchmark results, trained models and logs will be released soon (you can contact us if you require these files). You can download model files from Baidu Cloud (tgr6).

Table of Contents¶

Currently supported spatiotemporal prediction methods

[x] ConvLSTM (NeurIPS’2015)
[x] PredNet (ICLR’2017)
[x] PredRNN (NeurIPS’2017)
[x] PredRNN++ (ICML’2018)
[x] E3D-LSTM (ICLR’2018)
[x] MIM (CVPR’2019)
[x] CrevNet (ICLR’2020)
[x] PhyDNet (CVPR’2020)
[x] MAU (NeurIPS’2021)
[x] PredRNN.V2 (TPAMI’2022)
[x] SimVP (CVPR’2022)
[x] SimVP.V2 (ArXiv’2022)
[x] TAU (CVPR’2023)
[x] DMVFN (CVPR’2023)

Currently supported MetaFormer models for SimVP

[x] ViT (ICLR’2021)
[x] Swin-Transformer (ICCV’2021)
[x] MLP-Mixer (NeurIPS’2021)
[x] ConvMixer (Openreview’2021)
[x] UniFormer (ICLR’2022)
[x] PoolFormer (CVPR’2022)
[x] ConvNeXt (CVPR’2022)
[x] VAN (ArXiv’2022)
[x] IncepU (SimVP.V1) (CVPR’2022)
[x] gSTA (SimVP.V2) (ArXiv’2022)
[x] HorNet (NeurIPS’2022)
[x] MogaNet (ArXiv’2022)

Moving MNIST Benchmarks¶

We provide benchmark results on the popular Moving MNIST dataset using \(10\rightarrow 10\) frames prediction setting following PredRNN. Metrics (MSE, MAE, SSIM, pSNR) of the best models are reported in three trials. Parameters (M), FLOPs (G), and V100 inference FPS (s) are also reported for all methods. All methods are trained by Adam optimizer with Onecycle scheduler and single GPU.

STL Benchmarks on MMNIST¶

For a fair comparison of different methods, we report the best results when models are trained to converge. We provide config files in configs/mmnist.

Method	Setting	Params	FLOPs	FPS	MSE	MAE	SSIM	PSNR	Download
ConvLSTM-S	200 epoch	15.0M	56.8G	113	29.80	90.64	0.9288	22.10	model \| log
ConvLSTM-L	200 epoch	33.8M	127.0G	50	27.78	86.14	0.9343	22.44	model \| log
PredNet	200 epoch	12.5M	8.6G	659	161.38	201.16	0.7783	14.33	model \| log
PhyDNet	200 epoch	3.1M	15.3G	182	28.19	78.64	0.9374	22.62	model \| log
PredRNN	200 epoch	23.8M	116.0G	54	23.97	72.82	0.9462	23.28	model \| log
PredRNN++	200 epoch	38.6M	171.7G	38	22.06	69.58	0.9509	23.65	model \| log
MIM	200 epoch	38.0M	179.2G	37	22.55	69.97	0.9498	23.56	model \| log
MAU	200 epoch	4.5M	17.8G	201	26.86	78.22	0.9398	22.76	model \| log
E3D-LSTM	200 epoch	51.0M	298.9G	18	35.97	78.28	0.9320	21.11	model \| log
CrevNet	200 epoch	5.0M	270.7G	10	30.15	86.28	0.9350		model \| log
PredRNN.V2	200 epoch	23.9M	116.6G	52	24.13	73.73	0.9453	23.21	model \| log
DMVFN	200 epoch	3.5M	0.2G	1145	123.67	179.96	0.8140	16.15	model \| log
SimVP+IncepU	200 epoch	58.0M	19.4G	209	32.15	89.05	0.9268	37.97	model \| log
SimVP+gSTA-S	200 epoch	46.8M	16.5G	282	26.69	77.19	0.9402	38.35	model \| log
TAU	200 epoch	44.7M	16.0G	283	24.60	71.93	0.9454	23.19	model \| log
ConvLSTM-S	2000 epoch	15.0M	56.8G	113	22.41	73.07	0.9480	23.54	model \| log
PredNet	2000 epoch	12.5M	8.6G	659	31.85	90.01	0.9273	21.85	model \| log
PhyDNet	2000 epoch	3.1M	15.3G	182	20.35	61.47	0.9559	24.21	model \| log
PredRNN	2000 epoch	23.8M	116.0G	54	26.43	77.52	0.9411	22.90	model \| log
PredRNN++	2000 epoch	38.6M	171.7G	38	14.07	48.91	0.9698	26.37	model \| log
MIM	2000 epoch	38.0M	179.2G	37	14.73	52.31	0.9678	25.99	model \| log
MAU	2000 epoch	4.5M	17.8G	201	22.25	67.96	0.9511	23.68	model \| log
E3D-LSTM	2000 epoch	51.0M	298.9G	18	24.07	77.49	0.9436	23.19	model \| log
PredRNN.V2	2000 epoch	23.9M	116.6G	52	17.26	57.22	0.9624	25.01	model \| log
SimVP+IncepU	2000 epoch	58.0M	19.4G	209	21.15	64.15	0.9536	23.99	model \| log
SimVP+gSTA-S	2000 epoch	46.8M	16.5G	282	15.05	49.80	0.9675	25.97	model \| log
TAU	2000 epoch	44.7M	16.0G	283	15.69	51.46	0.9661	25.71	model \| log

Benchmark of MetaFormers Based on SimVP (MetaVP)¶

Since the hidden Translator in SimVP can be replaced by any Metaformer block which achieves token mixing and channel mixing, we benchmark popular Metaformer architectures on SimVP with training times of 200-epoch and 2000-epoch. We provide config files in configs/mmnist/simvp.

MetaVP	Setting	Params	FLOPs	FPS	MSE	MAE	SSIM	PSNR	Download
IncepU (SimVPv1)	200 epoch	58.0M	19.4G	209	32.15	89.05	0.9268	21.84	model \| log
gSTA (SimVPv2)	200 epoch	46.8M	16.5G	282	26.69	77.19	0.9402	22.78	model \| log
ViT	200 epoch	46.1M	16.9G	290	35.15	95.87	0.9139	21.67	model \| log
Swin Transformer	200 epoch	46.1M	16.4G	294	29.70	84.05	0.9331	22.22	model \| log
Uniformer	200 epoch	44.8M	16.5G	296	30.38	85.87	0.9308	22.13	model \| log
MLP-Mixer	200 epoch	38.2M	14.7G	334	29.52	83.36	0.9338	22.22	model \| log
ConvMixer	200 epoch	3.9M	5.5G	658	32.09	88.93	0.9259	21.93	model \| log
Poolformer	200 epoch	37.1M	14.1G	341	31.79	88.48	0.9271	22.03	model \| log
ConvNeXt	200 epoch	37.3M	14.1G	344	26.94	77.23	0.9397	22.74	model \| log
VAN	200 epoch	44.5M	16.0G	288	26.10	76.11	0.9417	22.89	model \| log
HorNet	200 epoch	45.7M	16.3G	287	29.64	83.26	0.9331	22.26	model \| log
MogaNet	200 epoch	46.8M	16.5G	255	25.57	75.19	0.9429	22.99	model \| log
TAU	200 epoch	44.7M	16.0G	283	24.60	71.93	0.9454	23.19	model \| log
IncepU (SimVPv1)	2000 epoch	58.0M	19.4G	209	21.15	64.15	0.9536	23.99	model \| log
gSTA (SimVPv2)	2000 epoch	46.8M	16.5G	282	15.05	49.80	0.9675	25.97	model \| log
ViT	2000 epoch	46.1M	16.9.G	290	19.74	61.65	0.9539	24.59	model \| log
Swin Transformer	2000 epoch	46.1M	16.4G	294	19.11	59.84	0.9584	24.53	model \| log
Uniformer	2000 epoch	44.8M	16.5G	296	18.01	57.52	0.9609	24.92	model \| log
MLP-Mixer	2000 epoch	38.2M	14.7G	334	18.85	59.86	0.9589	24.58	model \| log
ConvMixer	2000 epoch	3.9M	5.5G	658	22.30	67.37	0.9507	23.73	model \| log
Poolformer	2000 epoch	37.1M	14.1G	341	20.96	64.31	0.9539	24.15	model \| log
ConvNeXt	2000 epoch	37.3M	14.1G	344	17.58	55.76	0.9617	25.06	model \| log
VAN	2000 epoch	44.5M	16.0G	288	16.21	53.57	0.9646	25.49	model \| log
HorNet	2000 epoch	45.7M	16.3G	287	17.40	55.70	0.9624	25.14	model \| log
MogaNet	2000 epoch	46.8M	16.5G	255	15.67	51.84	0.9661	25.70	model \| log
TAU	2000 epoch	44.7M	16.0G	283	15.69	51.46	0.9661	25.71	model \| log

(back to top)

Moving FMNIST Benchmarks¶

Similar to Moving MNIST, we also provide the advanced version of MNIST, i.e., MFMNIST benchmark results, using \(10\rightarrow 10\) frames prediction setting following PredRNN. Metrics (MSE, MAE, SSIM, pSNR) of the best models are reported in three trials. Parameters (M), FLOPs (G), and V100 inference FPS (s) are also reported for all methods. All methods are trained by Adam optimizer with Onecycle scheduler and single GPU.

STL Benchmarks on MFMNIST¶

For a fair comparison of different methods, we report the best results when models are trained to convergence. We provide config files in configs/mfmnist.

Method	Setting	Params	FLOPs	FPS	MSE	MAE	SSIM	PSNR	Download
ConvLSTM-S	200 epoch	15.0M	56.8G	113	28.87	113.20	0.8793	22.07	model \| log
ConvLSTM-L	200 epoch	33.8M	127.0G	50	25.51	104.85	0.8928	22.67	model \| log
PredNet	200 epoch	12.5M	8.6G	659	185.94	318.30	0.6713	14.83	model \| log
PhyDNet	200 epoch	3.1M	15.3G	182	34.75	125.66	0.8567	22.03	model \| log
PredRNN	200 epoch	23.8M	116.0G	54	22.01	91.74	0.9091	23.42	model \| log
PredRNN++	200 epoch	38.6M	171.7G	38	21.71	91.97	0.9097	23.45	model \| log
MIM	200 epoch	38.0M	179.2G	37	23.09	96.37	0.9043	23.13	model \| log
MAU	200 epoch	4.5M	17.8G	201	26.56	104.39	0.8916	22.51	model \| log
E3D-LSTM	200 epoch	51.0M	298.9G	18	35.35	110.09	0.8722	21.27	model \| log
PredRNN.V2	200 epoch	23.9M	116.6G	52	24.13	97.46	0.9004	22.96	model \| log
DMVFN	200 epoch	3.5M	0.2G	1145	118.32	220.02	0.7572	16.76	model \| log
SimVP+IncepU	200 epoch	58.0M	19.4G	209	30.77	113.94	0.8740	21.81	model \| log
SimVP+gSTA-S	200 epoch	46.8M	16.5G	282	25.86	101.22	0.8933	22.61	model \| log
TAU	200 epoch	44.7M	16.0G	283	24.24	96.72	0.8995	22.87	model \| log

Benchmark of MetaFormers Based on SimVP (MetaVP)¶

Since the hidden Translator in SimVP can be replaced by any Metaformer block which achieves token mixing and channel mixing, we benchmark popular Metaformer architectures on SimVP with training times of 200 epochs. We provide config files in configs/mfmnist/simvp.

MetaFormer	Setting	Params	FLOPs	FPS	MSE	MAE	SSIM	PSNR	Download
IncepU (SimVPv1)	200 epoch	58.0M	19.4G	209	30.77	113.94	0.8740	21.81	model \| log
gSTA (SimVPv2)	200 epoch	46.8M	16.5G	282	25.86	101.22	0.8933	22.61	model \| log
ViT	200 epoch	46.1M	16.9.G	290	31.05	115.59	0.8712	21.83	model \| log
Swin Transformer	200 epoch	46.1M	16.4G	294	28.66	108.93	0.8815	22.08	model \| log
Uniformer	200 epoch	44.8M	16.5G	296	29.56	111.72	0.8779	21.97	model \| log
MLP-Mixer	200 epoch	38.2M	14.7G	334	28.83	109.51	0.8803	22.01	model \| log
ConvMixer	200 epoch	3.9M	5.5G	658	31.21	115.74	0.8709	21.71	model \| log
Poolformer	200 epoch	37.1M	14.1G	341	30.02	113.07	0.8750	21.95	model \| log
ConvNeXt	200 epoch	37.3M	14.1G	344	26.41	102.56	0.8908	22.49	model \| log
VAN	200 epoch	44.5M	16.0G	288	31.39	116.28	0.8703	22.82	model \| log
HorNet	200 epoch	45.7M	16.3G	287	29.19	110.17	0.8796	22.03	model \| log
MogaNet	200 epoch	46.8M	16.5G	255	25.14	99.69	0.8960	22.73	model \| log
TAU	200 epoch	44.7M	16.0G	283	24.24	96.72	0.8995	22.87	model \| log

(back to top)

Moving MNIST-CIFAR Benchmarks¶

Similar to Moving MNIST, we further design the advanced version of MNIST with complex backgrounds from CIFAR-10, i.e., MMNIST-CIFAR benchmark, using \(10\rightarrow 10\) frames prediction setting following PredRNN. Metrics (MSE, MAE, SSIM, pSNR) of the best models are reported in three trials. Parameters (M), FLOPs (G), and V100 inference FPS (s) are also reported for all methods. All methods are trained by Adam optimizer with Onecycle scheduler and single GPU.

STL Benchmarks on MMNIST-CIFAR¶

For a fair comparison of different methods, we report the best results when models are trained to convergence. We provide config files in configs/mmnist_cifar.

Method	Setting	Params	FLOPs	FPS	MSE	MAE	SSIM	PSNR	Download
ConvLSTM-S	200 epoch	15.5M	58.8G	113	73.31	338.56	0.9204	23.09	model \| log
ConvLSTM-L	200 epoch	34.4M	130.0G	50	62.86	291.05	0.9337	23.83	model \| log
PredNet	200 epoch	12.5M	8.6G	945	286.70	514.14	0.8139	17.49	model \| log
PhyDNet	200 epoch	3.1M	15.3G	182	142.54	700.37	0.8276	19.92	model \| log
PredRNN	200 epoch	23.8M	116.0G	54	50.09	225.04	0.9499	24.90	model \| log
PredRNN++	200 epoch	38.6M	171.7G	38	44.19	198.27	0.9567	25.60	model \| log
MIM	200 epoch	38.8M	183.0G	37	48.63	213.44	0.9521	25.08	model \| log
MAU	200 epoch	4.5M	17.8G	201	58.84	255.76	0.9408	24.19	model \| log
E3D-LSTM	200 epoch	52.8M	306.0G	18	80.79	214.86	0.9314	22.89	model \| log
PredRNN.V2	200 epoch	23.9M	116.6G	52	57.27	252.29	0.9419	24.24	model \| log
DMVFN	200 epoch	3.6M	0.2G	960	298.73	606.92	0.7765	17.07	model \| log
SimVP+IncepU	200 epoch	58.0M	19.4G	209	59.83	214.54	0.9414	24.15	model \| log
SimVP+gSTA-S	200 epoch	46.8M	16.5G	282	51.13	185.13	0.9512	24.93	model \| log
TAU	200 epoch	44.7M	16.0G	275	48.17	177.35	0.9539	25.21	model \| log

Benchmark of MetaFormers Based on SimVP (MetaVP)¶

Since the hidden Translator in SimVP can be replaced by any Metaformer block which achieves token mixing and channel mixing, we benchmark popular Metaformer architectures on SimVP with training times of 200 epochs. We provide config files in configs/mmnist_cifar/simvp.

MetaFormer	Setting	Params	FLOPs	FPS	MSE	MAE	SSIM	PSNR	Download
IncepU (SimVPv1)	200 epoch	58.0M	19.4G	209	59.83	214.54	0.9414	24.15	model \| log
gSTA (SimVPv2)	200 epoch	46.8M	16.5G	282	51.13	185.13	0.9512	24.93	model \| log
ViT	200 epoch	46.1M	16.9G	290	64.94	234.01	0.9354	23.90	model \| log
Swin Transformer	200 epoch	46.1M	16.4G	294	57.11	207.45	0.9443	24.34	model \| log
Uniformer	200 epoch	44.8M	16.5G	296	56.96	207.51	0.9442	24.38	model \| log
MLP-Mixer	200 epoch	38.2M	14.7G	334	57.03	206.46	0.9446	24.34	model \| log
ConvMixer	200 epoch	3.9M	5.5G	658	59.29	219.76	0.9403	24.17	model \| log
Poolformer	200 epoch	37.1M	14.1G	341	60.98	219.50	0.9399	24.16	model \| log
ConvNeXt	200 epoch	37.3M	14.1G	344	51.39	187.17	0.9503	24.89	model \| log
VAN	200 epoch	44.5M	16.0G	288	59.59	221.32	0.9398	25.20	model \| log
HorNet	200 epoch	45.7M	16.3G	287	55.79	202.73	0.9456	24.49	model \| log
MogaNet	200 epoch	46.8M	16.5G	255	49.48	184.11	0.9521	25.07	model \| log
TAU	200 epoch	44.7M	16.0G	275	48.17	177.35	0.9539	25.21	model \| log

(back to top)

KittiCaltech Benchmarks¶

We provide benchmark results on KittiCaltech Pedestrian dataset using \(10\rightarrow 1\) frames prediction setting following PredNet. Metrics (MSE, MAE, SSIM, pSNR, LPIPS) of the best models are reported in three trials. Parameters (M), FLOPs (G), and V100 inference FPS (s) are also reported for all methods. The default training setup is trained 100 epochs by Adam optimizer with Onecycle scheduler on single GPU, while some computational consuming methods (denoted by *) using 4GPUs.

STL Benchmarks on KittiCaltech¶

For a fair comparison of different methods, we report the best results when models are trained to convergence. We provide config files in configs/kitticaltech.

Method	Setting	Params	FLOPs	FPS	MSE	MAE	SSIM	PSNR	LPIPS	Download
ConvLSTM-S	100 epoch	15.0M	595.0G	33	139.6	1583.3	0.9345	27.46	0.08575	model \| log
E3D-LSTM*	100 epoch	54.9M	1004G	10	200.6	1946.2	0.9047	25.45	0.12602	model \| log
PredNet	100 epoch	12.5M	42.8G	94	159.8	1568.9	0.9286	27.21	0.11289	model \| log
PhyDNet	100 epoch	3.1M	40.4G	117	312.2	2754.8	0.8615	23.26	0.32194	model \| log
MAU	100 epoch	24.3M	172.0G	16	177.8	1800.4	0.9176	26.14	0.09673	model \| log
MIM	100 epoch	49.2M	1858G	39	125.1	1464.0	0.9409	28.10	0.06353	model \| log
PredRNN	100 epoch	23.7M	1216G	17	130.4	1525.5	0.9374	27.81	0.07395	model \| log
PredRNN++	100 epoch	38.5M	1803G	12	125.5	1453.2	0.9433	28.02	0.13210	model \| log
PredRNN.V2	100 epoch	23.8M	1223G	52	147.8	1610.5	0.9330	27.12	0.08920	model \| log
DMVFN	100 epoch	3.6M	1.2G	557	183.9	1531.1	0.9314	26.95	0.04942	model \| log
SimVP+IncepU	100 epoch	8.6M	60.6G	57	160.2	1690.8	0.9338	26.81	0.06755	model \| log
SimVP+gSTA-S	100 epoch	15.6M	96.3G	40	129.7	1507.7	0.9454	27.89	0.05736	model \| log
TAU	100 epoch	44.7M	80.0G	55	131.1	1507.8	0.9456	27.83	0.05494	model \| log

Benchmark of MetaFormers Based on SimVP (MetaVP)¶

Since the hidden Translator in SimVP can be replaced by any Metaformer block which achieves token mixing and channel mixing, we benchmark popular Metaformer architectures on SimVP with 100-epoch training. We provide config files in configs/kitticaltech/simvp.

MetaFormer	Setting	Params	FLOPs	FPS	MSE	MAE	SSIM	PSNR	LPIPS	Download
IncepU (SimVPv1)	100 epoch	8.6M	60.6G	57	160.2	1690.8	0.9338	26.81	0.06755	model \| log
gSTA (SimVPv2)	100 epoch	15.6M	96.3G	40	129.7	1507.7	0.9454	27.89	0.05736	model \| log
ViT*	100 epoch	12.7M	155.0G	25	146.4	1615.8	0.9379	27.43	0.06659	model \| log
Swin Transformer	100 epoch	15.3M	95.2G	49	155.2	1588.9	0.9299	27.25	0.08113	model \| log
Uniformer*	100 epoch	11.8M	104.0G	28	135.9	1534.2	0.9393	27.66	0.06867	model \| log
MLP-Mixer	100 epoch	22.2M	83.5G	60	207.9	1835.9	0.9133	26.29	0.07750	model \| log
ConvMixer	100 epoch	1.5M	23.1G	129	174.7	1854.3	0.9232	26.23	0.07758	model \| log
Poolformer	100 epoch	12.4M	79.8G	51	153.4	1613.5	0.9334	27.38	0.07000	model \| log
ConvNeXt	100 epoch	12.5M	80.2G	54	146.8	1630.0	0.9336	27.19	0.06987	model \| log
VAN	100 epoch	14.9M	92.5G	41	127.5	1476.5	0.9462	27.98	0.05500	model \| log
HorNet	100 epoch	15.3M	94.4G	43	152.8	1637.9	0.9365	27.09	0.06004	model \| log
MogaNet	100 epoch	15.6M	96.2G	36	131.4	1512.1	0.9442	27.79	0.05394	model \| log
TAU	100 epoch	44.7M	80.0G	55	131.1	1507.8	0.9456	27.83	0.05494	model \| log

(back to top)

KTH Benchmarks¶

We provide long-term prediction benchmark results on KTH Action dataset using \(10\rightarrow 20\) frames prediction setting. Metrics (MSE, MAE, SSIM, pSNR, LPIPS) of the best models are reported in three trials. Parameters (M), FLOPs (G), and V100 inference FPS (s) are also reported for all methods. The default training setup is trained 100 epochs by Adam optimizer with a batch size of 16 and Onecycle scheduler on single GPU or 4GPUs, and we report the used GPU setups for each method (also shown in the config).

STL Benchmarks on KTH¶

For a fair comparison of different methods, we report the best results when models are trained to convergence. We provide config files in configs/kth. Notice that 4xbs4 indicates 4GPUs DDP training with the batch size of 4 on each GPU.

Method	Setting	GPUs	Params	FLOPs	FPS	MSE	MAE	SSIM	PSNR	LPIPS	Download
ConvLSTM-S	100 epoch	1xbs16	14.9M	1368.0G	16	47.65	445.5	0.8977	26.99	0.26686	model \| log
E3D-LSTM	100 epoch	2xbs8	53.5M	217.0G	17	136.40	892.7	0.8153	21.78	0.48358	model \| log
PredNet	100 epoch	1xbs16	12.5M	3.4G	399	152.11	783.1	0.8094	22.45	0.32159	model \| log
PhyDNet	100 epoch	1xbs16	3.1M	93.6G	58	91.12	765.6	0.8322	23.41	0.50155	model \| log
MAU	100 epoch	1xbs16	20.1M	399.0G	8	51.02	471.2	0.8945	26.73	0.25442	model \| log
MIM	100 epoch	1xbs16	39.8M	1099.0G	17	40.73	380.8	0.9025	27.78	0.18808	model \| log
PredRNN	100 epoch	1xbs16	23.6M	2800.0G	7	41.07	380.6	0.9097	27.95	0.21892	model \| log
PredRNN++	100 epoch	1xbs16	38.3M	4162.0G	5	39.84	370.4	0.9124	28.13	0.19871	model \| log
PredRNN.V2	100 epoch	1xbs16	23.6M	2815.0G	7	39.57	368.8	0.9099	28.01	0.21478	model \| log
DMVFN	100 epoch	1xbs16	3.5M	0.88G	727	59.61	413.2	0.8976	26.65	0.12842	model \| log
SimVP+IncepU	100 epoch	2xbs8	12.2M	62.8G	77	41.11	397.1	0.9065	27.46	0.26496	model \| log
SimVP+gSTA-S	100 epoch	4xbs4	15.6M	76.8G	53	45.02	417.8	0.9049	27.04	0.25240	model \| log
TAU	100 epoch	4xbs4	15.0M	73.8G	55	45.32	421.7	0.9086	27.10	0.22856	model \| log

Benchmark of MetaFormers Based on SimVP (MetaVP)¶

Since the hidden Translator in SimVP can be replaced by any Metaformer block which achieves token mixing and channel mixing, we benchmark popular Metaformer architectures on SimVP with 100-epoch training. We provide config files in configs/kth/simvp.

MetaFormer	Setting	GPUs	Params	FLOPs	FPS	MSE	MAE	SSIM	PSNR	LPIPS	Download
IncepU (SimVPv1)	100 epoch	2xbs8	12.2M	62.8G	77	41.11	397.1	0.9065	27.46	0.26496	model \| log
gSTA (SimVPv2)	100 epoch	2xbs8	15.6M	76.8G	53	45.02	417.8	0.9049	27.04	0.25240	model \| log
ViT	100 epoch	2xbs8	12.7M	112.0G	28	56.57	459.3	0.8947	26.19	0.27494	model \| log
Swin Transformer	100 epoch	2xbs8	15.3M	75.9G	65	45.72	405.7	0.9039	27.01	0.25178	model \| log
Uniformer	100 epoch	2xbs8	11.8M	78.3G	43	44.71	404.6	0.9058	27.16	0.24174	model \| log
MLP-Mixer	100 epoch	2xbs8	20.3M	66.6G	34	57.74	517.4	0.8886	25.72	0.28799	model \| log
ConvMixer	100 epoch	2xbs8	1.5M	18.3G	175	47.31	446.1	0.8993	26.66	0.28149	model \| log
Poolformer	100 epoch	2xbs8	12.4M	63.6G	67	45.44	400.9	0.9065	27.22	0.24763	model \| log
ConvNeXt	100 epoch	2xbs8	12.5M	63.9G	72	45.48	428.3	0.9037	26.96	0.26253	model \| log
VAN	100 epoch	2xbs8	14.9M	73.8G	55	45.05	409.1	0.9074	27.07	0.23116	model \| log
HorNet	100 epoch	2xbs8	15.3M	75.3G	58	46.84	421.2	0.9005	26.80	0.26921	model \| log
MogaNet	100 epoch	2xbs8	15.6M	76.7G	48	42.98	418.7	0.9065	27.16	0.25146	model \| log
TAU	100 epoch	2xbs8	15.0M	73.8G	55	45.32	421.7	0.9086	27.10	0.22856	model \| log

(back to top)

Human 3.6M Benchmarks¶

We further provide high-resolution benchmark results on Human3.6M dataset using \(4\rightarrow 4\) frames prediction setting. Metrics (MSE, MAE, SSIM, pSNR, LPIPS) of the best models are reported in three trials. We use 256x256 resolutions, similar to STRPM. Parameters (M), FLOPs (G), and V100 inference FPS (s) are also reported for all methods. The default training setup is trained 100 epochs by Adam optimizer with a batch size of 16 and Cosine scheduler (no warm-up) on single GPU or 4GPUs, and we report the used GPU setups for each method (also shown in the config).

STL Benchmarks on Human 3.6M¶

For a fair comparison of different methods, we report the best results when models are trained to convergence. We provide config files in configs/human.

Method	Setting	GPUs	Params	FLOPs	FPS	MSE	MAE	SSIM	PSNR	LPIPS	Download
ConvLSTM-S	50 epoch	1xbs16	15.5M	347.0	52	125.5	1566.7	0.9813	33.40	0.03557	model \| log
E3D-LSTM	50 epoch	4xbs4	60.9M	542.0	7	143.3	1442.5	0.9803	32.52	0.04133	model \| log
PredNet	50 epoch	1xbs16	12.5M	13.7	176	261.9	1625.3	0.9786	31.76	0.03264	model \| log
PhyDNet	50 epoch	1xbs16	4.2M	19.1	57	125.7	1614.7	0.9804	39.84	0.03709	model \| log
MAU	50 epoch	1xbs16	20.2M	105.0	6	127.3	1577.0	0.9812	33.33	0.03561	model \| log
MIM	50 epoch	4xbs4	47.6M	1051.0	17	112.1	1467.1	0.9829	33.97	0.03338	model \| log
PredRNN	50 epoch	1xbs16	24.6M	704.0	25	113.2	1458.3	0.9831	33.94	0.03245	model \| log
PredRNN++	50 epoch	1xbs16	39.3M	1033.0	18	110.0	1452.2	0.9832	34.02	0.03196	model \| log
PredRNN.V2	50 epoch	1xbs16	24.6M	708.0	24	114.9	1484.7	0.9827	33.84	0.03334	model \| log
SimVP+IncepU	50 epoch	1xbs16	41.2M	197.0	26	115.8	1511.5	0.9822	33.73	0.03467	model \| log
SimVP+gSTA-S	50 epoch	1xbs16	11.3M	74.6	52	108.4	1441.0	0.9834	34.08	0.03224	model \| log
TAU	50 epoch	1xbs16	37.6M	182.0	26	113.3	1390.7	0.9839	34.03	0.02783	model \| log

Benchmark of MetaFormers Based on SimVP (MetaVP)¶

Since the hidden Translator in SimVP can be replaced by any Metaformer block which achieves token mixing and channel mixing, we benchmark popular Metaformer architectures on SimVP with 100-epoch training. We provide config files in configs/kth/human.

MetaFormer	Setting	GPUs	Params	FLOPs	FPS	MSE	MAE	SSIM	PSNR	LPIPS	Download
IncepU (SimVPv1)	50 epoch	1xbs16	41.2M	197.0	26	115.8	1511.5	0.9822	33.73	0.03467	model \| log
gSTA (SimVPv2)	50 epoch	1xbs16	11.3M	74.6	52	108.4	1441.0	0.9834	34.08	0.03224	model \| log
ViT	50 epoch	4xbs4	28.3M	239.0	17	136.3	1603.5	0.9796	33.10	0.03729	model \| log
Swin Transformer	50 epoch	1xbs16	38.8M	188.0	28	133.2	1599.7	0.9799	33.16	0.03766	model \| log
Uniformer	50 epoch	4xbs4	27.7M	211.0	14	116.3	1497.7	0.9824	33.76	0.03385	model \| log
MLP-Mixer	50 epoch	1xbs16	47.0M	164.0	34	125.7	1511.9	0.9819	33.49	0.03417	model \| log
ConvMixer	50 epoch	1xbs16	3.1M	39.4	84	115.8	1527.4	0.9822	33.67	0.03436	model \| log
Poolformer	50 epoch	1xbs16	31.2M	156.0	30	118.4	1484.1	0.9827	33.78	0.03313	model \| log
ConvNeXt	50 epoch	1xbs16	31.4M	157.0	33	113.4	1469.7	0.9828	33.86	0.03305	model \| log
VAN	50 epoch	1xbs16	37.5M	182.0	24	111.4	1454.5	0.9831	33.93	0.03335	model \| log
HorNet	50 epoch	1xbs16	28.1M	143.0	33	118.1	1481.1	0.9824	33.73	0.03333	model \| log
MogaNet	50 epoch	1xbs16	8.6M	63.6	56	109.1	1446.4	0.9834	34.05	0.03163	model \| log
TAU	50 epoch	1xbs16	37.6M	182.0	26	113.3	1390.7	0.9839	34.03	0.02783	model \| log

(back to top)