finding deviations from the mean

cool2021
Iron Contributor
Aug 02, 2022
mtarler , the attached workbook has another run of a sample of photovoltaic plant data through my box plot outliers time series algorithm. From the original dataset, I took a sample of 45 solar panels and ran the 15-minute interval photovoltaic rays capture data through my algorithm that identified a number of time series segments. If you look at the final column on the 'Data' tab, this represents the number of panels (maximum of 45) during a specific 15-minute interval period that had a 'Low' photovoltaic data capture as classified by the algorithm. If you then go to the 'Pivot' tab of the worksheet, you will see a pivot table summarizing - within each hour of the day - the % of photovoltaic panels with 'Low' capture. For example, between 8 and 9am, there were 4 readings of data (four 15 minute intervals). And, you can see that one of those readings indicated that all of the 45 panels in this sample were catching 'Low' photovoltaic rays. Between 10am and 4pm, 100% of solar panels were capturing the optimal level of photovoltaic rays across all of the 15 minute intervals. But, after 4pm, you then start to see a higher percentage of readings where at least one or more of the 45 solar panels being measured are capturing 'Low' photovoltaic rays. And, incidentally, even if you were to change the raw data so the time intervals are columns and panels are rows, you will still find a similar result.
TESTRUN-BoxplotAnalysis-V2.1.xlsx38.4 MB
cool2021
Iron Contributor
Jul 13, 2022
mtarler one other point: Microsoft's colour scaling criteria are not based on any test of statistical significance, my process is: https://support.microsoft.com/en-us/office/highlight-patterns-and-trends-with-conditional-formatting-eea152f5-2a7d-4c1a-a2da-c5f893adb621

So, you may see patterns and trends using the Microsoft scales: but you will not identify statistically significant (important) events that are occurring. And, my process will, regardless of data scale and what type of data it is: $, #, % every single time. That is the key difference. When you see any shading using my process, it is statistically significant: always.

Thanks
cool2021
Iron Contributor
Jul 13, 2022
Can't one or a row of solar panels be adjusted to angle more to catch optimal light? Don't know. Repeated, statistically low solar capture is what analysts are looking for and any ways panels can be adjusted to capture. One outlier, two outlier....etc....is not the point. Patterns of outliers (low or high is)....whether that pattern be across solar panels at a point in time or within a solar panel at repeated during the day....peak periods when sunlight strongest, for example. Patterns. Not points. IS the point. And that seems to be the question this poster asked: how can I find panels....rows...individual...that are not capturing optimal sunlight during periods of peak sunlight.
mtarler
Silver Contributor
Jul 13, 2022
cool2021, your template tool is really great and can be really useful for particular situations, I just don't know if this is one of them. I don't know what you mean by "more robust results". If by that you mean it will show more outlier data points then maybe but I would contend that would be "less robust results". I don't think they want to be distracted by needing to investigate every time a cloud hovered over 1 panel 1 day. The whole point of averaging and filtering and many other data techniques is try and parse out relevant and important information from the noise and the not important information. If your tool is highlighting all the noise and the actual bad data together then is that really "more robust"? To be honest I'm just looking at your output and saying to myself, If I was them, how could I use that output to do what I think they want and I'm just not seeing it helping. In the file I gave above, I averaged the output for each sensor across all the days for each giving timeslot to help filter out the nuances. Then I both highlighted the remaining potential outliers using conditional formatting and did a count of potential outliers for each panel; each based on 2 different thresholds (I used 1sd and 2sd, which is likely not the optimal but as good as any without more data/analysis). So I see this as a tool to help indicate which panels have repeatedly low output throughout the day (e.g. the panel is going bad or obstructed or dirty) and which panels tend to have low output at particular times during the day (e.g. the tracking mechanism isn't optimal or maybe location is shaded by something).
I really don't see what your analysis is doing for them. Also since it is column based I really don't understand what it is going to help because it has a natural trend from low to high back to low. And specifically in this above attached comparison you have 2 "outliers" highlighted and I look at the data and don't see what it is pointing out. That said I would have been really curious to see if you had extended the data set a few more rows to include the 9:00 hour to see what would happen. I'm guessing it may highlight a row there because the 9:15 data seems to have a drop for many of the panels (probably a large cloud). So PLEASE help me understand how to interpret your outputs so i can know when to recommend this useful tool.
cool2021
Iron Contributor
Jul 13, 2022
mtarler I believe the core issue is the number of columns in the spreadsheet (8,245 columns to reflect the total number of panels), and not the granularity of the data. Take a look at the attached report after running my Boxplot Analysis process on the first 10 records of both your Daily Average summarized data (DailyAVGOut worksheet tab) and the raw data (RawOUT worksheet tab): you will find that the granular, raw data yields more robust results. The issue, now, is that the dataset must be organized better so that one or two rows of panels (fewer columns) is being analyzed at one time to make the analysis run quicker.
TESTRUN-BoxplotAnalysis.xlsx44.1 MB
Valentinaampuero
Copper Contributor
Jul 11, 2022
cool2021 mtarler Thank you guys for the interest and the help. I now have better tools to solve this problem.
But...

I don't think I was clear enough.

The panels have a mounting system that moves following the trajectory of the sun to take better advantage of the radiation. However, this movement is programmed through the "tracker" system. Due to misconfiguration, there are certain panels that shade each other, which occurs at sunrise and sunset.
That said, I need to find those panels that shadow each other (via the current data) so I can fix the tracking configuration.

Panels that have lower (how much lower?) sunrise and sunset current values should be those that are misconfigured. However, we should not overlook the fact that clouds can pass through and also affect the current values.
mtarler
Silver Contributor
Jul 11, 2022
That sounds pretty great. Have you looked at the data file I uploaded above? In that file the tab labelled sheet0 is the raw data and I created the dailyAvg tab. I think we might be talking much the same and maybe applying your template to my added tab would be good. Maybe open that file and show me/us how your template works. Thanks for sharing it sounds great.
cool2021
Iron Contributor
Jul 11, 2022
If you apply my technique to the raw data without 0s then it will be more robust and easier to operationalize from an IT and reporting perspective. Also, why would you not want to identify specific times of the day (not just overall in a day) when your solar capture is not optimal. I assume objective here is to identify any within day solar capture issues as well as just measuring daily. My template has a VBA-enabled algorithm that would allow a user to click one button and it would sift through every column of the data and identify within day solar capture gaps and patterns of gaps across time (days).

Forum Discussion

Resources