I was searching a large emissions database for pairs of columns with correlated values.
My first effort identified a large number of sets with an R2 value of 1 based on blindly applying the REGR_R2 function to discrete pairs of numeric data columns.
When I started reviewing outputs I noticed that many of the REGR_R2 results were not reliable because REGR_R2 appears to return a value of 1 when the y column is constant (i.e., y_col) in the syntax REGR_R2(y_col, x_col).
I plan to add a standard deviation calculation to the script prior to the REGR_R2 calculation and forgo the REGR_R2 calculation when the standard deviation of y_col = 0.
But why does REGR_R2 return a value of 1 when the y parameter is constant and the x parameter is variable? Shouldn't it return either 0 or null?
table xy_values
y_val x_val
row 1 0.0038 0.023
row 2 0.0038 0.022
row 3 0.0038 0.085
row 4 0.0038 0.087
row 5 0.0038 0.019
select regr_r2(xy_values.y_val, xy_values.x_val)
from xy_values;
yields
REGR_R2(Y_VAL,X_VAL) = 1
Thanks,
Stephen