Apr 01 2024 02:13 PM - edited Apr 01 2024 02:26 PM
Hey all - I am running a regression analysis in excel for 30 data points and I am getting a #NUM! error in my p-value. what is really odd is if I exclude the last row and only include 29 data points, it works just fine. Any ideas on how I can resolve this error?
edit: link to OneDrive with file in there Regression
Thanks!
Apr 01 2024 02:21 PM - edited Apr 01 2024 02:24 PM
@MikeJames630 wrote: "Any ideas on how I can resolve this error?"
You can start by providing an Excel file that demonstrates the problem. The Excel file should have all the original data and formulas. Of course, remove any personal information, like names.
If you cannot attach the Excel file using the "browse files" link below the response area (someone else said they could not), upload the Excel file to a file-sharing website such as box.net/files, dropbox.com, onedrive.live.com, etc. (Not Google Sheets.) Then post a link to the file that allows us to download it without logging in. It would be prudent to ensure that the link is "view only", not "allow editing".
And you cannot post an actual link (URL) (sigh), edit the link by removing the first few periods and slash. For example, the "link" to this respsonse is techcommunity microsoft com /t5/excel/num-error-in-regression-analysis-p-value/m-p/4102577/highlight/false#M225503. (wink)
Apr 01 2024 02:27 PM
Apr 02 2024 03:24 PM
With your data, both t-stat and p-value should display a #DIV/0 error for both the x and intercept coefficients because for all x-values, the original y-values in Q6:Q35 exactly equals the estimated y-values, which is calculated by the regression formula estY = m*x + b.
Apparently, the Data Analysis Regression tool arbitrarily replaces those errors with 65535 for t-stat and #NUM for p-value.
We know that 65535 is arbitrary because if it were the actual t-stat, the p-value would be TDIST(65535, 30-2, 2), which returns 3.7449E-116, not an Excel error.
(Also, 65535 is the largest unsigned 16-bit integer. That's a suspicious "coincidence".)
The p-value calculation returns a #DIV/0 error because it references the t-stat result, which is #DIV/0. The p-value is calculated by the formula =TDIST(tStat, n-2, 2), where n is the number of observations in B8.
The t-stat result is #DIV/0 because it is calculated by the formula =coeff / stdErr, and the std err is zero.
The std err is zero because it is calculated, in part, by the formula
=SQRT(SUMPRODUCT((Q6:Q35 - R6:R35)^2)*.../....
The SUMPRODUCT is the sum of the squared difference (SSE) between the original y-values in Q6:Q35 and the estimated y-values estY = m*x + b, which I calculate in R6:R35.
In your case, with m=-1, b=0 and x = -y, estY equals y for all values of x. Consequently, the SSE is zero.
(Aside.... In "dynamic-array aware" versions of Excel, SUMPRODUCT can be replaced with SUM, which might be clearer.)
-----
Just for grins, we can "correct" this behavior with a very small change to just one y-value.
For example, if the first y-value is 0.0004897000001 instead of 0.0004897, but the first x-value continues to be -0.0004897, both the t-stat and p-value are computable because the std err is no longer zero.
Note: I am __not__ suggesting such "corrections" as a work-around. It is merely a "proof of concept".
TMI.... With the y-value change, the resultign zero p-value for the x coefficient is due to numerical limitations of 64-bit binary floating-point, which is how numbers are represented internally. If the t-stat were 4.80E+11, the p-value would be 2.291E-308, which is very close to the smallest possible number (2.225E-308). Since the absolute value of the t-stat for the x coefficient is much larger (6.668E+12), the p-value calculation "underflows" and results in zero.