Replication of Netflix Subjective Study

Undoubtedly, subjective study conducted in well-controlled laboratory environment with huge panel of observers is the gold standard for comparison of image and video processing algorithms. Unfortunately, crowdsourced studies do not allow researchers to control content viewing conditions, since study participants are viewing content at their homes with various devices. However, it is significantly easier to recruit large amount of participants for crowdsourced study than bring the same number of people to the laboratory. Thus, noise introduced by poorly-controlled viewing conditions of crowdsourced study can be reduced by larger amount of participants.

In this post we show that results of study conducted with Subjectify.us platform are close to results obtained in laboratory. For this purpose we use Subjectify.us to replicate the user-study conducted by Netflix in well-controlled laboratory environment and then compare the results of both studies. Furthermore, we show that results obtained with Subjectify.us have significantly higher correlation with laboratory results than scores computed using objective metrics (e.g. PSNR, SSIM, and VMAF which was recently proposed by Netflix).

Dataset

The dataset used in Netflix subjective study consists of videos of various nature (e.g. animation, fast motion, landscape footage) compressed by H.264 encoder at various bitrate and resolution. Comparison of such content is a challenging task for crowdsourced setting, since poor viewing conditions can make barely noticeable differences in high-bitrate videos absolutely invisible.

The public part of Netflix dataset consists of 9 test video sequences, for each test sequence there are 6-10 distorted videos as well as one original undistorted video. For our study we randomly selected 7 video sequences from Netflix dataset and then uploaded both undistorted and distorted files to Subjectify.us platform.

Perceptual Data Collection

The uploaded videos were shown to study participants in a pairwise fashion. The videos were displayed at full-screen mode one after another. After each pair of videos the participant was asked to choose the video with the best visual quality or indicate that the quality of videos is equal. The participant also had an option to replay videos.

Each study participant compared 10 pairs of videos including 2 hidden quality-control comparisons between original undistorted video and video compressed at 375 kbps bitrate. The answers of participants who failed at least one quality-control comparison were rejected. Participants were allowed to complete our questionnaire up to 5 times. In total we collected 11235 answers from 375 unique participants. Subjectify.us converted collected answers to final quality scores using Crowd Bradley-Terry (Chen, Bennett, Collins-Thompson, & Horvitz, 2013) model.

Data Analysis

To evaluate quality of scores computed by Subjectify.us in crowdsourced setting, we compute correlation between these scores and DMOS scores from Netflix experiment conducted in the laboratory environment. As a baseline we also compute correlation between DMOS scores and scores estimated by widely used objective quality metrics:

PSNR
SSIM (Wang, Bovik, Sheikh, & Simoncelli, 2004)
MSSSIM (Wang, Simoncelli, & Bovik, 2003)
VQM (Xiao & others, 2000)
VMAF, which was recently proposed by Netflix alongside the dataset used in this study

Computed correlation coefficients are depicted in the figure below:

Correlation with DMOS scores — Correlation coefficients between DMOS scores collected in laboratory environment and scores estimated by Subjectify.us compared with correlation coefficients for objective quality metrics.

The figure shows that scores estimated with Subjectify.us platform have high correlation with ground-truth DMOS scores according to both Pearson (0.9614) and Spearman (0.9567) ranks. Moreover, these correlation coefficients are significantly higher than coefficients gained by objective quality metrics. Notably, VMAF metric was designed by Netflix alongside the dataset used in our study, thus it might be over-trained for this particular dataset. However, in our experiment Subjectify.us platform gained higher correlation coefficients than VMAF.

To visually evaluate how well various methods are able to predict ground-truth DMOS scores, we show them on scatter plots below:

DMOS vs. Predicted scores — DMOS scores estimated in laboratory environment vs. scores predicted by Subjectify.us and well-known objective quality metrics. Points corresponding to one source video sequence have unique color.

Finally, we evaluate relation between the number of collected responses and correlation of estimated scores with ground-truth DMOS scores. Below we show this relation augmented with correlation coefficients gained by VMAF, MSSSIM, and SSIM quality metrics:

Correlation vs. Number of Responses — Correlation coefficients between ground-truth DMOS scores and scores computed by Subjectify.us for various number of participants’ responses. The baselines indicate correlation coefficients gained by objective quality metrics: VMAF, MSSSIM, and SSIM.

Expectedly, the correlation coefficients grow with the number of collected responses. MSSSIM, SSIM, VQM and PSNR metrics are outperformed almost immediately. VMAF scores are outperformed if 1000 responses or more are used in ranks computation.

Conclusion

In this post we showed that quality scores computed by Subjectify.us platform in the crowdsourced setting is better alternative for the comparison of compressed videos than scores estimated using objective quality metrics, since Subjectify.us scores have higher correlation with the results of experiment conducted in the laboratory. Moreover, correlation coefficients between Subjectify.us scores and ground-truth scores have high absolute values (0.9614 Pearson rank, 0.9567 Spearman rank) indicating that they are very close to the laboratory results.

References

Chen, X., Bennett, P. N., Collins-Thompson, K., & Horvitz, E. (2013). Pairwise ranking aggregation in a crowdsourced setting. In Proceedings of the sixth ACM international conference on Web search and data mining - WSDM ’13. Association for Computing Machinery (ACM). http://doi.org/10.1145/2433396.2433420
Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600–612. http://doi.org/10.1109/TIP.2003.819861
Wang, Z., Simoncelli, E. P., & Bovik, A. C. (2003). Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems Computers, 2003 (Vol. 2, pp. 1398–1402 Vol.2). http://doi.org/10.1109/ACSSC.2003.1292216
Xiao, F., & others. (2000). DCT-based video quality evaluation. Final Project for EE392J, 769.