SOLVED

Looking for trends in unrelated related data

Copper Contributor

I don't know if this is in the exact right spot, but here goes.

 

Let's say that I'm a biologist analyzing litters of puppies:

 

  • I've got six litters I want to analyze.
  • Each of those litters could have anywhere from five to ten puppies.
    • Each one of those puppies should be identical (for the sake of the thought experiment), but there's variation between each one.
  • Each of those puppies has roughly 30 different areas of the body that I want to analyze.
    • These 30 areas are the exact same spots on all of the puppies across the six litters, and, functionally speaking should be identical to each other of the chosen areas on the same puppy
  • For each one of those areas, I'm investigating aspects of up to each of the following features: skin, fur, and muscle
  • Each one of those features has a given set of characteristics
    • Fur: length, density, color, pattern
    • Skin: color, pattern, elasticity (okay, I know that sounds weird, but I'm grasping at straws for this analogy)
    • Muscle: strength, volume, stamina (lots and lots of straws being grasped)
  • Combining the above results in hundreds of trios of data that I'm trying to analyze: Area - Feature - Characteristic
  • Each trio has its own unique set of allowable values:
    • Area 2 fur length can be between 1 and 2 inches, whereas area 18 fur length can be any length under 4 inches
    • Area 7 skin color can be pink, white, or black, but area 20 skin color can only be brown
    • The Area 12 muscles need to be strong enough to lift 4 lbs, but there is no specification for the strength of the Area 2 muscles
  • Each unique trio of data points has the same allowable range as that same trio of data points across all puppies across all litters
    • Area 6 fur density from puppy 2, litter 1 has the same set of allowable values as Area 6 fur density for puppy 9, litter 10, as does area 12 fur density from puppy 2, litter 1.
  • Despite the fact that the range of allowable values for each trio of data is unique, each area is affected by the same things that affect all other areas; the same is true for Features and Characteristics:
    • Whether or not the puppies are indoor or outdoor dogs affects all fur length
    • The puppy's pedigree affects all puppies' skin color and elasticity
    • How much exercise any one puppy gets affects the strength of all of its muscles
  • Additionally, one trio of data may lend clues to another trio of data
    • Area 10 skin pattern often has an affect on area 10 fur pattern

Given all of that: I have thousands of data points that I'm trying to analyze and draw conclusions from, and I'm looking for the best way(s) to do so. Pivot Table seems like it'd be helpful, but, as I move the data around to better understand it, I can't make conditional formatting follow individual cells around that would highlight cells showing bad values based on that data point's area, feature, and characteristic. I'm also poking around with Power Query, or whatever it's called now, but I haven't been able to make anything useful.

 

Suggestions would be incredibly helpful; otherwise I just have to look at all of this data manually.

19 Replies

Hello @jamesson_kaupanger ,

 

Power Query will help if you need to clean the data or merge different data sources into one.

 

For analysis you may want to use Power Pivot. Load the data into the Power Pivot Data Model. Then you can create all kinds of measures for all the different qualities and properties. Total (sum), count and average are just a fraction of what Power Pivot can calculate. With the Power Pivot measures, you can then build pivot tables and charts to visualise the information.

 

Have a look for Power Pivot articles and tutorials. It was made for the stuff you describe.

 

@Ingeborg Hawighorst 

 

Thanks for responding.

 

I'm not looking to clean the data up; I trust the results themselves, and I'm trying to analyze them as they are.

 

I was thinking that Pivot Tables (which I'm assuming is related to Power Pivot, right?) might be the best way to go; it's just obnoxious that I can't seem to make the formatting sticky enough.

@jamesson_kaupanger  Pivot tables existed long before Power Pivot came along. Power Pivot has a new set of functions that can be used to analyze data and it goes way beyond what can be done with traditional pivot tables alone. You will still build pivot tables off the Power Pivot data model, but the measures that can be created with Power Pivot allow for a lot more and a lot differentiated analysis.

 

Load the data into the PowerPivot data model and create helper columns to help tag and classify data. Then these helper columns can be used in either measures or in filters and slicers of the pivot table you create from the data model.

 

Power Query can help you consolidate different data sources and shape data to the perfect form before it is loaded into the data model. Power Query on its own is not an analysis tool.

@Ingeborg Hawighorst 

 

Okay, then maybe you can help me set up my data; I've been trying to go through the tutorial, and while I think I understand better, I'm still not certain how best to set up my data to analyze it properly.

 

My data looks roughly like this (assuming that all of the associated data are numbers):

Table 1:

LitterPuppySiteFur colorFur patternSkin colorSkin pattern
1126736
114242345332
12276234340
12412452323

Table 2:

LitterPuppySiteFur colorFur patternSkin colorSkin pattern
2121723402
2145754512
22267489910
2242533

 

Again, this is somewhat simplified: I've got five litters, each of which has between 5 and 10 puppies, etc.

 

Any suggestions?

@Ingeborg Hawighorst 

Or perhaps this is a better way to talk about this:

 

 

LitterPuppySiteType of TestTestValue
5 Unique values, not repeatedBetween five and ten values, unique to one litter but Puppy 1 from litter 1 should be the "same" as Puppy 1 from litter 2~30 unique Sites, again, each unique to one Puppy, but each Puppy uses the same sites4 different test typesEach type of test either has one or two tests associated with it

The measured value from a given test of a given test type on a specific site of a particular puppy from a particular lot

 

Each value has an upper and a lower acceptable limit

 

Clear as mud?

@jamesson_kaupanger  It looks like the tables all have the same structure. In order to analyze across these tables, use Power Query to Append the queries into one. Load this into the Data Model.

 

I don't understand how puppy 2 can be the same for different litters. Biologically that is not possible, but if you use this as a metaphor only, sure, use the puppy number/code/name as the unique ID.

 

You can now build pivot tables and use slicers and/or filters to look at all qualities/properties across all tables. 

 

If you need help building these pivot tables, I would need to know what questions the pivot tables/charts should answer.  

 

I'll try that and let you know how it goes.

Yes, it's all metaphorical; for all intents and purposes, every single puppy across all the litters is supposed to be identical, and thus every test result for each characteristic from every test site from every puppy from every litter should be the same. That's what I'm trying to analyze: variations that may have resulted from processing defects, and how/if variations and nonconformances are related to each other, either across tests of the same type, or if specific sites are chronically problematic, or specific puppies are particularly prone to nonconformances.
Actually, technically speaking the data is more closely formatted to that first spreadsheet, but I'm thinking that it might be necessary to format it according to the second one. Unless I can extract that information after the fact.

@Ingeborg Hawighorst 

 

Basically, I don't know how to relate things:

 

I want all puppies of a given litter to be related.

I want all puppies of a specific number across the different litters related.

I want all sites from a given puppy related.

I want every instance of a particular site number related across all puppies of a litter.

I want every instance of a particular site number related across all puppies of all litters.

I want every instance of a particular site number related across all puppies of a specific number across all litters.

I want each test type related across all sites.

I want each test related across all sites. 

Etc.

 

I'm looking for correlation across at least 6 axes in the hopes of looking for trends.

 

@jamesson_kaupanger  I'm not sure what you mean by "I want x be related to y". 

 

Try to put into words what questions you need answered. For example:

 

- How many puppies of litter 1 have a fur color of 6 and a skin pattern of 23?

 

To answer that, you can build a pivot table that counts puppies, grouped by litter, fur color and skin pattern. You can use filters or slicers to show just a single combination of fur color and skin pattern

 

- What percentage of all puppies has Skin color 23 across all sites?

 

Again, build a pivot table, group by skin color and calculate the percentage of the total for all puppies.

 

Is that what you mean by related? You will need to build different pivot tables to answer different questions. 

 

 

@Ingeborg Hawighorst 

 

For instance: 

 

Let's say Litter 2 Puppy 4's Site 21 fur color is wrong.

 

  • Do Puppies from Litter 2 have a higher-than-average rate of wrong fur color?
  • Do all Puppy 4s have a higher-than-average rate of wrong fur color?
  • Do all Site 21s have a higher-than-average rate of wrong fur color?
  • If a Puppy's fur color is wrong, does that mean there's a greater-than-average likelihood that their fur length is wrong?
  • If a Puppy's fur color is wrong, does that mean there's a greater-than-average likelihood that their eye color is wrong?
best response confirmed by jamesson_kaupanger (Copper Contributor)
Solution

@jamesson_kaupanger 

 

What constitutes "wrong" fur color?

 

In your sample above, the Fur color is a numerical value. If you want to mark this as either right or wrong, then that is already one piece of analysis that you need to undertake in a separate step.

 

For example, you could create a column in the data that shows "fur color status", for which the value can be wrong or right or whatever. To arrive at the value, you may want to employ a formula that evaluates some of the other properties of that puppy. Maybe skin colors below 10 should have fur colors over 100 and if the fur color is not over 100, it is classified as wrong. Or something like that. An IF function should do the trick. With that function in place, you can then classify the data by the fur color status.

 

  • Do Puppies from Litter 2 have a higher-than-average rate of wrong fur color?

Create a pivot table that shows the average rate for each fur color status in the columns, the litter numbers in the rows.

  fur color status
Litter correct wrong
1 80% 20%
2 75% 25%
3 95% 5%
4 19% 89%
overall 67% 35%

 

  • Do all Puppy 4s have a higher-than-average rate of wrong fur color?

use a similar table, but with puppy numbers in the rows.

 

  • Do all Site 21s have a higher-than-average rate of wrong fur color?

use a similar table, but with site numbers in the rows.

 

  • If a Puppy's fur color is wrong, does that mean there's a greater-than-average likelihood that their fur length is wrong?
  • If a Puppy's fur color is wrong, does that mean there's a greater-than-average likelihood that their eye color is wrong?

The first three bullets were simple calculations. These last two are about probability and go deeper into the realm of statistical analysis. Again, you will first need to build the structures (helper columns) required to classify by fur length status and eye color status.

 

If you load the data into the Power Pivot data model, you can add the helper columns there. You can then also use the powerful statistical DAX functions  to create measures and then surface the results in pivot tables.

 

I'm not sure whether your question is more about understanding how Excel works, or what math to use to calculate a value with several variables, or if you need help with the whole concept of statistical analysis. 

 

Neither of these can be explained in a single forum question, because they each have their own learning curves.

 

 

 

@Ingeborg Hawighorst 

 

"Wrong" in the sense that it's an imperfect metaphor; each test has an associated range of acceptable values (6 to 10, -0.61 to -0.21, etc.), and if the tested value lies outside of that range, it's "wrong", or unacceptable, or out of spec.

 

I'd like to think that the main source of my difficulty, at this point, is in using Excel, given that I've only ever used it to represent and/or calculate data rather than derive conclusions from said data. I say, "...like to think..." because I'm not sure; it may be some combination of all three.

 

In any case, you've been fabulously helpful. I'll go tweak the data and see what I can come up with. Thank you!

@Ingeborg Hawighorst 

 

Okay, this is incredibly aggravating: do you have any idea why the Table Import Wizard is importing nearly 500 empty rows, even though those rows don't exist in the original table?

 

Edit: and 8 extra columns, mind you.

@jamesson_kaupanger 

They are empty rows/columns, but they are not clean ones. Most probably if you click Ctrl+End you'll be 5 rows down from your data and 8 columns to the right from it.

You may try to select these empty regions, on ribbon Home->Clear All. Save file, open again and check Ctrl+End if helped.

 

At the same time you may always filter nulls in Power Query.

@Sergei Baklan 

 

I could filter them but I couldn't create relationships between tables because there were "duplicate" values.

 

I've gotten rid of them.

 

Is there a way to mark on a Pivot Table lines representing minimum or maximum quantities?

@jamesson_kaupanger 

These are two different issues. First you clean the tables from empty rows/columns, the rest is next.

 

Many-to-many relationships are not supported directly in Excel data model, but that could be workarounds, depends on how your data is structured. Sorry, I didn't check entire this thread, too many information - only answered on concrete post. Perhaps @Ingeborg Hawighorst could give more concrete advice. 

@Sergei Baklan 

 

Correct; I wasn't trying to create a many-to-many relationship: there were blank rows in one of the tables which was causing Excel to think that there were duplicate values.

 

Is there a way to put a horizontal line (preferably two) on a Pivot Table at a specific point?

@jamesson_kaupanger 

You may try to play with conditional formatting, but I have no practical experience in applying it to Pivot Tables. Try to google for samples.

1 best response

Accepted Solutions
best response confirmed by jamesson_kaupanger (Copper Contributor)
Solution

@jamesson_kaupanger 

 

What constitutes "wrong" fur color?

 

In your sample above, the Fur color is a numerical value. If you want to mark this as either right or wrong, then that is already one piece of analysis that you need to undertake in a separate step.

 

For example, you could create a column in the data that shows "fur color status", for which the value can be wrong or right or whatever. To arrive at the value, you may want to employ a formula that evaluates some of the other properties of that puppy. Maybe skin colors below 10 should have fur colors over 100 and if the fur color is not over 100, it is classified as wrong. Or something like that. An IF function should do the trick. With that function in place, you can then classify the data by the fur color status.

 

  • Do Puppies from Litter 2 have a higher-than-average rate of wrong fur color?

Create a pivot table that shows the average rate for each fur color status in the columns, the litter numbers in the rows.

  fur color status
Litter correct wrong
1 80% 20%
2 75% 25%
3 95% 5%
4 19% 89%
overall 67% 35%

 

  • Do all Puppy 4s have a higher-than-average rate of wrong fur color?

use a similar table, but with puppy numbers in the rows.

 

  • Do all Site 21s have a higher-than-average rate of wrong fur color?

use a similar table, but with site numbers in the rows.

 

  • If a Puppy's fur color is wrong, does that mean there's a greater-than-average likelihood that their fur length is wrong?
  • If a Puppy's fur color is wrong, does that mean there's a greater-than-average likelihood that their eye color is wrong?

The first three bullets were simple calculations. These last two are about probability and go deeper into the realm of statistical analysis. Again, you will first need to build the structures (helper columns) required to classify by fur length status and eye color status.

 

If you load the data into the Power Pivot data model, you can add the helper columns there. You can then also use the powerful statistical DAX functions  to create measures and then surface the results in pivot tables.

 

I'm not sure whether your question is more about understanding how Excel works, or what math to use to calculate a value with several variables, or if you need help with the whole concept of statistical analysis. 

 

Neither of these can be explained in a single forum question, because they each have their own learning curves.

 

 

 

View solution in original post