Forum Discussion
finding deviations from the mean
- Jul 11, 2022Valentinaampuero So I go the sample files (May-Nov of 2021). I am attaching the Nov data here. In this file I added a tab (DailyAvg) where I calculate the average for every :15 sec slot for each panel across all the days recorded. I then calculate the average (col LED) and std (col LEE) for each :15sec slot and added conditional formatting to highlight all the spots <1std (orange) and <2 std (red). Finally at the bottom I did a count for how many times that particular panel was found to be <1 STD and <2 STD on average. I hope this might be helpful and can be a template for the other months. Maybe even collate these summary counts across months for even better indicator of performance across the year. 
Valentinaampuero That is correct. And, if I understand correctly that this data is related to solar power capture, then all of the 0's before 7:00am in the morning in your dataset are also irrelevant: since, it simply makes sense that you will not capture solar energy when it is dark outside. Even still, without the impact of these 0's, yes, my median approach is more stable and does not rely on normality. Hope this helps.
That all said, I think this is all getting stuck in the nuances as the use of the mean and STD in this case was as a sample indicator not a statistical measure. By that I am not using the STD to prove or disprove a hypothesis. I encourage Valentinaampuero to use the template I created and use any threshold they wish. I picked one but they could use a 5 and 10 percentile rank or 10 and 20% of max or the median and something... Unless they have or generate additional data about what a true indicator for detecting a bad panel, these numbers and measures are just thumbs in the wind.
I'm sure your template is great and I hope it helps them. I think the 'hurdle' I got them over was how to average the week's worth of date per time slot and per panel.
Valentinaampuero may you can 'ping' cool2021 with some questions or tips on statistical details that might help.
- cool2021Aug 02, 2022Iron Contributormtarler , the attached workbook has another run of a sample of photovoltaic plant data through my box plot outliers time series algorithm. From the original dataset, I took a sample of 45 solar panels and ran the 15-minute interval photovoltaic rays capture data through my algorithm that identified a number of time series segments. If you look at the final column on the 'Data' tab, this represents the number of panels (maximum of 45) during a specific 15-minute interval period that had a 'Low' photovoltaic data capture as classified by the algorithm. If you then go to the 'Pivot' tab of the worksheet, you will see a pivot table summarizing - within each hour of the day - the % of photovoltaic panels with 'Low' capture. For example, between 8 and 9am, there were 4 readings of data (four 15 minute intervals). And, you can see that one of those readings indicated that all of the 45 panels in this sample were catching 'Low' photovoltaic rays. Between 10am and 4pm, 100% of solar panels were capturing the optimal level of photovoltaic rays across all of the 15 minute intervals. But, after 4pm, you then start to see a higher percentage of readings where at least one or more of the 45 solar panels being measured are capturing 'Low' photovoltaic rays. And, incidentally, even if you were to change the raw data so the time intervals are columns and panels are rows, you will still find a similar result. 
- cool2021Jul 13, 2022Iron Contributormtarler one other point: Microsoft's colour scaling criteria are not based on any test of statistical significance, my process is: https://support.microsoft.com/en-us/office/highlight-patterns-and-trends-with-conditional-formatting-eea152f5-2a7d-4c1a-a2da-c5f893adb621 So, you may see patterns and trends using the Microsoft scales: but you will not identify statistically significant (important) events that are occurring. And, my process will, regardless of data scale and what type of data it is: $, #, % every single time. That is the key difference. When you see any shading using my process, it is statistically significant: always. Thanks 
- cool2021Jul 13, 2022Iron ContributorCan't one or a row of solar panels be adjusted to angle more to catch optimal light? Don't know. Repeated, statistically low solar capture is what analysts are looking for and any ways panels can be adjusted to capture. One outlier, two outlier....etc....is not the point. Patterns of outliers (low or high is)....whether that pattern be across solar panels at a point in time or within a solar panel at repeated during the day....peak periods when sunlight strongest, for example. Patterns. Not points. IS the point. And that seems to be the question this poster asked: how can I find panels....rows...individual...that are not capturing optimal sunlight during periods of peak sunlight.
- mtarlerJul 13, 2022Silver Contributorcool2021, your template tool is really great and can be really useful for particular situations, I just don't know if this is one of them. I don't know what you mean by "more robust results". If by that you mean it will show more outlier data points then maybe but I would contend that would be "less robust results". I don't think they want to be distracted by needing to investigate every time a cloud hovered over 1 panel 1 day. The whole point of averaging and filtering and many other data techniques is try and parse out relevant and important information from the noise and the not important information. If your tool is highlighting all the noise and the actual bad data together then is that really "more robust"? To be honest I'm just looking at your output and saying to myself, If I was them, how could I use that output to do what I think they want and I'm just not seeing it helping. In the file I gave above, I averaged the output for each sensor across all the days for each giving timeslot to help filter out the nuances. Then I both highlighted the remaining potential outliers using conditional formatting and did a count of potential outliers for each panel; each based on 2 different thresholds (I used 1sd and 2sd, which is likely not the optimal but as good as any without more data/analysis). So I see this as a tool to help indicate which panels have repeatedly low output throughout the day (e.g. the panel is going bad or obstructed or dirty) and which panels tend to have low output at particular times during the day (e.g. the tracking mechanism isn't optimal or maybe location is shaded by something).
 I really don't see what your analysis is doing for them. Also since it is column based I really don't understand what it is going to help because it has a natural trend from low to high back to low. And specifically in this above attached comparison you have 2 "outliers" highlighted and I look at the data and don't see what it is pointing out. That said I would have been really curious to see if you had extended the data set a few more rows to include the 9:00 hour to see what would happen. I'm guessing it may highlight a row there because the 9:15 data seems to have a drop for many of the panels (probably a large cloud). So PLEASE help me understand how to interpret your outputs so i can know when to recommend this useful tool.
- cool2021Jul 13, 2022Iron Contributormtarler I believe the core issue is the number of columns in the spreadsheet (8,245 columns to reflect the total number of panels), and not the granularity of the data. Take a look at the attached report after running my Boxplot Analysis process on the first 10 records of both your Daily Average summarized data (DailyAVGOut worksheet tab) and the raw data (RawOUT worksheet tab): you will find that the granular, raw data yields more robust results. The issue, now, is that the dataset must be organized better so that one or two rows of panels (fewer columns) is being analyzed at one time to make the analysis run quicker. 
- ValentinaampueroJul 11, 2022Copper Contributorcool2021 mtarler Thank you guys for the interest and the help. I now have better tools to solve this problem. But... I don't think I was clear enough. The panels have a mounting system that moves following the trajectory of the sun to take better advantage of the radiation. However, this movement is programmed through the "tracker" system. Due to misconfiguration, there are certain panels that shade each other, which occurs at sunrise and sunset. That said, I need to find those panels that shadow each other (via the current data) so I can fix the tracking configuration. Panels that have lower (how much lower?) sunrise and sunset current values should be those that are misconfigured. However, we should not overlook the fact that clouds can pass through and also affect the current values. 
- mtarlerJul 11, 2022Silver ContributorThat sounds pretty great. Have you looked at the data file I uploaded above? In that file the tab labelled sheet0 is the raw data and I created the dailyAvg tab. I think we might be talking much the same and maybe applying your template to my added tab would be good. Maybe open that file and show me/us how your template works. Thanks for sharing it sounds great.
- cool2021Jul 11, 2022Iron ContributorIf you apply my technique to the raw data without 0s then it will be more robust and easier to operationalize from an IT and reporting perspective. Also, why would you not want to identify specific times of the day (not just overall in a day) when your solar capture is not optimal. I assume objective here is to identify any within day solar capture issues as well as just measuring daily. My template has a VBA-enabled algorithm that would allow a user to click one button and it would sift through every column of the data and identify within day solar capture gaps and patterns of gaps across time (days).