Forum Discussion
Your thoughts serializing arrays to manage state in excel (by way of inventory tracking)
I'm curious what you all think of this approach for managing state in excel with minimal computational impact. I'm sure after a few thousand different items to track, performance may be implicated, but the formulas initially feel like they will create static values that don't have to be recalculated with any frequency. Skip towards the end if you just want to see the formulas. (P.S. I updated the code to make it simpler and a bit more robust against invalid data)
I think one of the important ideas here is that this formula can be used within a table column the same as any other formula that doesn't return an array, the difference being that calculations are not lost between steps of the row and the only information a calculation needs is the limited state info in the prior state.
Any comments welcome. If you've found a better way to do this, I'd be curious to know. I am limiting myself to using only native excel formulas/functionality plus excel labs.
P.S. The formulas have one or two undeveloped ideas because they are not essential. For instance, there is one inline lambda that uses a switch to return a comparison function for data masking that has only implemented the equals and does not equal operators.
**********Description********************
General idea is such - write a series of formulas that serialize and deserialize arrays and combine it with an efficient mechanism to access the most recent state. The example uses the idea of lots of items where each lot has a quantity and cost - transactions can either increase the quantity (and cost) or decrease it. The state being managed is the existent lots, their quantities, and their costs.
columns are simple:
control; date; lot; qty; cost; state
So here is the overall structure:
table formula checks control column to see if the current row impacts the state -
if it doesn't, return nothing,
if it does, move on
if it is the first control number, just serialize the row data
if it isn't, find the prior state and update it
The way it updates the state is by
1) deserializing the prior state
2) determining whether the prior state already included the current row's item
3) updating that value if so or just returning the current rows values,
4) filtering the prior state to remove the current row's item
5) vstacking the filtered array onto the current row's entry (potentially updated), and
6) serializing the current state.
Using 40 rows as an example, I might end with a serialized state that looks like this:
lot_H,65,650;lot_I,100,1000;lot_J,45,450;lot_K,70,700;lot_L,30,300;lot_A,135,1350;lot_B,150,1500;lot_C,145,1450;lot_D,90,900;lot_E,120,1200;lot_F,80,800;lot_G,170,1700
and deserializes to this:
lot_A 135 1350
lot_B 150 1500
lot_C 145 1450
lot_D 90 900
lot_E 120 1200
lot_F 80 800
lot_G 170 1700
lot_H 65 650
lot_I 100 1000
lot_J 45 450
lot_K 70 700
lot_L 30 300
using this function: SORT(array.deserialize_string(I57))
table formula:
=LET(
current_trans_array, HSTACK([@lot], [@qty], [@cost]),
current_trans_date_no, [@date],
prior_state_date_no, MAX(FILTER([date], [date] < current_trans_date_no, 1)),
is_first_trans, (current_trans_date_no = prior_state_date_no),
prior_state_array, IF(
is_first_trans,
current_trans_array,
array.deserialize_string(XLOOKUP(prior_state_date_no, [date], [state], "Not found", 0))
),
updated_array, IF(
is_first_trans,
current_trans_array,
array.track_state(prior_state_array, current_trans_array)
),
result, array.serialize_array(updated_array),
result
)
module formulas:
deserialize_string = LAMBDA(state_string, TEXTSPLIT(state_string, ",", ";", FALSE));
serialize_row = LAMBDA(row, TEXTJOIN(",", FALSE, row));
serialize_array = LAMBDA(state_array,
REDUCE(
,
BYROW(state_array, LAMBDA(r, serialize_row(r))),
LAMBDA(acc, current_row, acc & ";" & current_row)
)
);
track_state =
lambda(
prior_state_array, current_array,
LET(
zero_if_empty, lambda(x, if(isnumber(value(x)),x,0)),
current_id, INDEX(current_array, 1, 1),
current_id_not_found_flag, hstack(current_id, 0, 0),
is_only_item,
AND(
ROWS(prior_state_array)=1,
INDEX(prior_state_array,1,1)=current_id
),
filtered_array,
FILTER(
prior_state_array,
CHOOSECOLS(prior_state_array,1)<>current_id,
current_id_not_found_flag
),
is_new_item,
IF(ROWS(prior_state_array)=ROWS(filtered_array),AND(filtered_array = prior_state_array),FALSE),
prior_array,
if(
is_new_item,
current_id_not_found_flag,
FILTER(
prior_state_array,
choosecols(prior_state_array,1)=current_id,
current_id_not_found_flag
)
),
updated_current_array,
HSTACK(
current_id,
(zero_if_empty(INDEX(prior_array, 1, 2)) + zero_if_empty(INDEX(current_array, 1, 2))),
zero_if_empty(INDEX(prior_array, 1, 3))+zero_if_empty(INDEX(current_array, 1, 3))
),
final_array,
if(
is_only_item,
updated_current_array,
vstack(filtered_array,updated_current_array)
),
sort(filter(final_array, CHOOSECOLS(final_array,2)<>0))
)
);
- lori_mSteel Contributor
This is an interesting discussion and I believe points to a limitation in Excel. What I would like to see is formulas being able to return more general 'Data Types' to cells including lambda definitions or array results so that data could be held in memory rather than output to the sheet. I've seen a few options over recent years that may be of relevance to this kind of scenario:
- The Python in Excel extension is taking steps in this direction by supporting dataframes within cells: https://support.microsoft.com/en-gb/office/python-in-excel-dataframes-a10495b2-8372-4f0f-9179-32771fe0dc04
- Add-in libraries that support 'object handles' (e.g. XLL+/XLW frameworks) which are of particular relevance with Real-Time Data where array sizes may be variable and for efficiency only dependent cells need be recalculated.
- Without any extensions, I think the suggested text serialisation approach is a decent workaround for smaller data sets e.g. using ARRAYTOTEXT / TEXTJOIN functions. Another possibility might be to store transactional data as records (#rows, #cols, {data}) via array shaping functions TOROW / WRAPROWS - which might overcome the 32k data restriction though tables could not then be utilised due to the need for data spilling. - PeterBartholomew1Silver Contributor
I really need to see the formula in the context of its workbook. You are doing a number of things that I wouldn't do but that does not mean they are wrong; they are worth exploring. I think you are calculating running totals for a number of products. The calculations I have experience of are simpler in that only one product stream is considered at a time but more complex in that it matches outputs to inputs on a FIFO basis.
I normally move outside the table to perform such calculation, using arrays and spill ranges rather than using the inherent replication of table formula. There seems to be some benefit of your approach in that persistent storage is used to hold the prior step of an cumulative calculation. The downside would appear to be that the serialised data representing the evolving state would seem to be cumbersome, so if appears there are swings and roundabouts to be evaluated.
I would suggest you use your serialised data to hold the current state but to record the minimum of accumulated data as possible. Since I would be using an array calculation, I would have access to every step returned by the calculation but that means that the entire calculation is processed each time data is added. I am sorry that my ability to visualise and evaluate your proposal has proved to be somewhat limited.
- joelb95Brass Contributor
Hi Peter,
Thanks for replying. I've attached a primitive example for you - at this point I haven't actually implemented it in a meaningful data set (though I have demonstrated it works across thousands of lines of data).
The motivation is overcoming the limitations of excel's requirement for every cell to return a value, i.e. you can't return a thunk or any other fancy container/wrapper to a cell and then reference that cell for its value. This would, theoretically, allow you to persist the intermediate calculations in involved formulas without having to solve the array of array problems (or using the name manager to name intermediate calculation results).
Obviously some refinement is needed (such as using better row/column delimiters) and if taken seriously, data compression techniques could be automatically implied (removing all but the most essential data and encoding that data to the extent reasonable). If done correctly, with about 30k characters available in a string, you can save quite a lot of information in one of excel's most efficient data structures (strings are saved as entire discrete values and given a single lookup key).
A bigger application of the idea, for instance, would be assigning individual lot numbers to every inflow of a security and being able to update that lots quantity and remaining cost basis according to whatever formula the cell using the array wants, i.e. this time specific identification, that time FIFO, another time LIFO, another time cost averaging, etc. Because the entire lot state is being persisted, a single outflow transaction of securities can relatively simply impact all affected security lots - whether that be one or a thousand. For instance, if you are using an average method over 15 lots of IBM shares, one formula could just search for lots of IBM in the prior state and adjust their values to their new remaining shares/value on a pro rata basis. Because the state was persisted, the entire updating of lots is basically a simple filter of one table that consists exclusively of the updated information for each remaining lot.
Even if this type of persistence were limited to something like once every thousand rows with the inbetween steps requiring full calculation/accumulation of all steps subsequent to the last persisted state, surely figuring out the state after 1,000 transactions is better than figuring it out after 10,000.
- PeterBartholomew1Silver Contributor
Your serialization appears to be well implemented and works. The methods we are developing are far removed from one another and the manner in which they scale is likely to be different. I tend to use thunks and perform entire calculations as one. I haven't given much thought to the way in which I could separate out 'historical' data to leave a smaller dynamic formula to bring the calculation up to date. Perhaps I should!
The idea of serialising the data appears to be considerably more complicated than assigning a field to stats from each lot but it should be more dynamic when new lots are introduced.
I have included the type of calculations I would turn to if faced with a problem such as that you describe, but it is so different that I doubt it will be of much value to you.