SOLVED

finding deviations from the mean

Copper Contributor

Hi everyone

 

I am working with large amounts of current data from a photovoltaic plant and I need to find those current values that deviate from the average and therefore are producing less than the rest. I only know that there is less production at sunrise and sunset because at that time the pv panels are shaded and the value of current produced decreases.

I have data every 15 minutes from 2021 to 2022.

 

I appreciate if anyone could help me with this.

 

Thanks in advance 

 

Valentina

19 Replies
This is a really cool project but really need more info. Can you share a sample of the sheet. If you can't attach it here you can share using a link in SharePoint or DropBox or the like or if you PM it to me here (click my icon and click message) I can attach it for you.
That said, about 1/2 of them will be less than the mean (sort of by definition but not exactly). So you need to define a better metric. Maybe less than 1 or 2 STD less than the mean or use functions like PERCENTILE or QUARTILE or RANK to help. But also need to determine if you want average values and how to average. As you noted the output will increase and decrease throughout the day and the location of some may be better than others. Should you be looking at the average for a day across the set or maybe look at the average for a month for the SAME one a year ago and see how much it dropped and in comparison to how much all the others have dropped because any given month may be more or less sunny.

@mtarler thank you for your answer.

 

I agree with you that there will be multiple below average current values which does not mean that all those PV panels are shaded. Those whose value is considerably below average (thank you very much for the suggestion to incorporate the standard deviation) would be those that are shaded.

 

I am analyzing the values every 15 minutes, ie.
7:00
7:15
7:30
7:45
8:00
etc.

In this way I will be able to have a daily, then monthly, and finally annual trend, understanding that the position of the sun and the angle of incidence of solar radiation has seasonal variation so that the modules that receive shade in summer may not be the same as in winter.

 

any chance you can attach the document or give a link to it in sharepoint or dropbox or something.
Thanks very much for your answer. I'm gonna check the IQR method, which a haven't heard about.

I think the values are not extrictely "outlyers" because there is not a huge difference between the current average and the minimun values (that it is supposed to be the panels that are shaded and therefore produce less current than the rest). The difference is between 10% or 30% less than the average.

Thank you very much, have a great day!

Best

best response confirmed by Hans Vogelaar (MVP)
Solution

@Valentinaampuero So I go the sample files (May-Nov of 2021).  I am attaching the Nov data here. In this file I added a tab (DailyAvg) where I calculate the average for every :15 sec slot for each panel across all the days recorded.  I then calculate the average (col LED) and std (col LEE) for each :15sec slot and added conditional formatting to highlight all the spots <1std (orange) and <2 std (red).  Finally at the bottom I did a count for how many times that particular panel was found to be <1 STD and <2 STD on average.  I hope this might be helpful and can be a template for the other months.  Maybe even collate these summary counts across months for even better indicator of performance across the year.

@mtarler Wow! you did in one hour what i couldn't do in the whole weekend hahaha

I'm looking at the file right now and trying to understand the formulas and the results.

 

Thank you so much! 

 

 

@Valentinaampuero please see my reply below to @mtarler

Hello, thanks for your answer. Now, if i have understood correctly the Standard deviation is not a valid measure because the data does not behave normally.

Maybe the way you propose is more accurate and it should be used.

Thanks

@Valentinaampuero That is correct. And, if I understand correctly that this data is related to solar power capture, then all of the 0's before 7:00am in the morning in your dataset are also irrelevant: since, it simply makes sense that you will not capture solar energy when it is dark outside. Even still, without the impact of these 0's, yes, my median approach is more stable and does not rely on normality. Hope this helps.

Hi cool2021, I'm no statistician and agree that use of standard deviation (at least for it to be meaningful in it's definition) does require the data to have a normal distribution. I also recall that if you don't know if the data behaves in a normal distribution then showing that the mean, median and mode are all the same is an acceptable indicator (but not a guarantee). I believe median has similar issues and concerns and depends on what you are trying to get. The extreme example is a data set of a bi-stable state so you have 2 spikes of data where the average gives a point in between, potentially impossible or unstable but median gives a point on 1 or the other spike that gives little credence to the bi-state. Basically you must know your data, understand your data, and know what you need.
That all said, I think this is all getting stuck in the nuances as the use of the mean and STD in this case was as a sample indicator not a statistical measure. By that I am not using the STD to prove or disprove a hypothesis. I encourage @Valentinaampuero to use the template I created and use any threshold they wish. I picked one but they could use a 5 and 10 percentile rank or 10 and 20% of max or the median and something... Unless they have or generate additional data about what a true indicator for detecting a bad panel, these numbers and measures are just thumbs in the wind.
I'm sure your template is great and I hope it helps them. I think the 'hurdle' I got them over was how to average the week's worth of date per time slot and per panel.
@Valentinaampuero may you can 'ping' @cool2021 with some questions or tips on statistical details that might help.
If you apply my technique to the raw data without 0s then it will be more robust and easier to operationalize from an IT and reporting perspective. Also, why would you not want to identify specific times of the day (not just overall in a day) when your solar capture is not optimal. I assume objective here is to identify any within day solar capture issues as well as just measuring daily. My template has a VBA-enabled algorithm that would allow a user to click one button and it would sift through every column of the data and identify within day solar capture gaps and patterns of gaps across time (days).
That sounds pretty great. Have you looked at the data file I uploaded above? In that file the tab labelled sheet0 is the raw data and I created the dailyAvg tab. I think we might be talking much the same and maybe applying your template to my added tab would be good. Maybe open that file and show me/us how your template works. Thanks for sharing it sounds great.

@cool2021 @mtarler Thank you guys for the interest and the help. I now have better tools to solve this problem. 

But...

 

I don't think I was clear enough.

 

The panels have a mounting system that moves following the trajectory of the sun to take better advantage of the radiation. However, this movement is programmed through the "tracker" system. Due to misconfiguration, there are certain panels that shade each other, which occurs at sunrise and sunset.

That said, I need to find those panels that shadow each other (via the current data) so I can fix the tracking configuration.

 

Panels that have lower (how much lower?) sunrise and sunset current values should be those that are misconfigured. However, we should not overlook the fact that clouds can pass through and also affect the current values.

 

 

 

 

@mtarler I believe the core issue is the number of columns in the spreadsheet (8,245 columns to reflect the total number of panels), and not the granularity of the data. Take a look at the attached report after running my Boxplot Analysis process on the first 10 records of both your Daily Average summarized data (DailyAVGOut worksheet tab) and the raw data (RawOUT worksheet tab): you will find that the granular, raw data yields more robust results. The issue, now, is that the dataset must be organized better so that one or two rows of panels (fewer columns) is being analyzed at one time to make the analysis run quicker.

@cool2021, your template tool is really great and can be really useful for particular situations, I just don't know if this is one of them. I don't know what you mean by "more robust results". If by that you mean it will show more outlier data points then maybe but I would contend that would be "less robust results". I don't think they want to be distracted by needing to investigate every time a cloud hovered over 1 panel 1 day. The whole point of averaging and filtering and many other data techniques is try and parse out relevant and important information from the noise and the not important information. If your tool is highlighting all the noise and the actual bad data together then is that really "more robust"? To be honest I'm just looking at your output and saying to myself, If I was them, how could I use that output to do what I think they want and I'm just not seeing it helping. In the file I gave above, I averaged the output for each sensor across all the days for each giving timeslot to help filter out the nuances. Then I both highlighted the remaining potential outliers using conditional formatting and did a count of potential outliers for each panel; each based on 2 different thresholds (I used 1sd and 2sd, which is likely not the optimal but as good as any without more data/analysis). So I see this as a tool to help indicate which panels have repeatedly low output throughout the day (e.g. the panel is going bad or obstructed or dirty) and which panels tend to have low output at particular times during the day (e.g. the tracking mechanism isn't optimal or maybe location is shaded by something).
I really don't see what your analysis is doing for them. Also since it is column based I really don't understand what it is going to help because it has a natural trend from low to high back to low. And specifically in this above attached comparison you have 2 "outliers" highlighted and I look at the data and don't see what it is pointing out. That said I would have been really curious to see if you had extended the data set a few more rows to include the 9:00 hour to see what would happen. I'm guessing it may highlight a row there because the 9:15 data seems to have a drop for many of the panels (probably a large cloud). So PLEASE help me understand how to interpret your outputs so i can know when to recommend this useful tool.
Can't one or a row of solar panels be adjusted to angle more to catch optimal light? Don't know. Repeated, statistically low solar capture is what analysts are looking for and any ways panels can be adjusted to capture. One outlier, two outlier....etc....is not the point. Patterns of outliers (low or high is)....whether that pattern be across solar panels at a point in time or within a solar panel at repeated during the day....peak periods when sunlight strongest, for example. Patterns. Not points. IS the point. And that seems to be the question this poster asked: how can I find panels....rows...individual...that are not capturing optimal sunlight during periods of peak sunlight.

@mtarler one other point: Microsoft's colour scaling criteria are not based on any test of statistical significance, my process is: https://support.microsoft.com/en-us/office/highlight-patterns-and-trends-with-conditional-formatting...

 

So, you may see patterns and trends using the Microsoft scales: but you will not identify statistically significant (important) events that are occurring. And, my process will, regardless of data scale and what type of data it is: $, #, % every single time. That is the key difference. When you see any shading using my process, it is statistically significant: always.

 

Thanks

@mtarler , the attached workbook has another run of a sample of photovoltaic plant data through my box plot outliers time series algorithm.  From the original dataset, I took a sample of 45 solar panels and ran the 15-minute interval photovoltaic rays capture data through my algorithm that identified a number of time series segments. If you look at the final column on the 'Data' tab, this represents the number of panels (maximum of 45) during a specific 15-minute interval period that had a 'Low' photovoltaic data capture as classified by the algorithm. If you then go to the 'Pivot' tab of the worksheet, you will see a pivot table summarizing - within each hour of the day - the % of photovoltaic panels with 'Low' capture. For example, between 8 and 9am, there were 4 readings of data (four 15 minute intervals). And, you can see that one of those readings indicated that all of the 45 panels in this sample were catching 'Low' photovoltaic rays. Between 10am and 4pm, 100% of solar panels were capturing the optimal level of photovoltaic rays across all of the 15 minute intervals. But, after 4pm, you then start to see a higher percentage of readings where at least one or more of the 45 solar panels being measured are capturing 'Low' photovoltaic rays. And, incidentally, even if you were to change the raw data so the time intervals are columns and panels are rows, you will still find a similar result.

@Valentinaampuero I actually found a way to efficiently run a report on all 8,245 solar panels data at once. Really works well.

1 best response

Accepted Solutions
best response confirmed by Hans Vogelaar (MVP)
Solution

@Valentinaampuero So I go the sample files (May-Nov of 2021).  I am attaching the Nov data here. In this file I added a tab (DailyAvg) where I calculate the average for every :15 sec slot for each panel across all the days recorded.  I then calculate the average (col LED) and std (col LEE) for each :15sec slot and added conditional formatting to highlight all the spots <1std (orange) and <2 std (red).  Finally at the bottom I did a count for how many times that particular panel was found to be <1 STD and <2 STD on average.  I hope this might be helpful and can be a template for the other months.  Maybe even collate these summary counts across months for even better indicator of performance across the year.

View solution in original post