Association Rule mining using excel; Count how many rows have two searched for numbers

DoctorofMadness

New Member
Joined
Mar 13, 2023
Messages
6
Office Version
  1. 365
Platform
  1. Windows
I'm a college student working on a report currently and am having an issue with a problem I've recently run into and it's really throwing me for a loop. I thought I found a solution in this post, but couldn't get it to output anything but 0.
[[ Count how many times in total two values appear together in a row for all the rows ]]

To begin the project I was given a dataset of a little under 18000 transactions, with 19 different items that may have been purchased. The goal of the project is to draw what item types have associations with each other; for instance, if someone buys product 17, what is the confidence we can have that they would also purchase product 4 in the same transaction. The way I have the sheet looks as follows (first 20 transactions)

Transaction_IDCustomer_IDItem_TypesItem Types
1​
15107144​
17,8,5,4,17854
2​
15107169​
17,9,8,1798
3​
15120097​
13,8,138
4​
15128454​
9,9,8,998
5​
15128488​
10,10,7,10107
6​
15131912​
2,2
7​
15134734​
3,2,2,322
8​
15500173​
13,10,8,7,5,2,2,1,1310875221
9​
15502484​
5,5
10​
15507087​
7,7
11​
15508887​
8,8
12​
15510149​
12,12
13​
15513135​
12,12
14​
15514612​
13,13
15​
15518225​
13,10,1,13101
16​
15518985​
8,8,3,883
17​
15520494​
13,5,135
18​
15523811​
17,17
19​
15524504​
14,11,10,5,5,2,1,1411105521
20​
15529982​
12,12

On the right side of the spreadsheet, the data is just pulled out of the Item_Types column using Textsplit() to remove the commas.

I thought I was onto something with the thread I linked above but cannot get it to output anything except 0, but I also have two theories onto why it doesn't work; each transaction might see the same item precedent or antecedent ( first item or second item ) more than once (see Transaction_ID 19, #5 appears twice).

I also decided to write the formula with 100 columns in the mmult() part, since I thought the large amount of blank space wouldn't affect anything since the way I am interpreting it to work.

=SUM((MMULT(--('Transaction IDs + Item Types'!D2:CY17918=L3),{1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1})>0)
*(MMULT(--('Transaction IDs + Item Types'!D2:CY17918=O3),{1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1})>0))


Let me know if you have any ideas. I am trying to figure a way to write a formula that will check if a row contains two values I am looking for anywhere in the row, while allowing the same number(s) to potentially appear more than once. I also know there are different ways this could be done, one of which using powerpivot, others using RStudio, I just haven't used those tools before and would rather try and do it on software I understand before resorting to learning new tools.
 

Excel Facts

Whats the difference between CONCAT and CONCATENATE?
The newer CONCAT function can reference a range of cells. =CONCATENATE(A1,A2,A3,A4,A5) becomes =CONCAT(A1:A5)
Hi DoctorofMadness,

With the data you have shown above in range A1:K:21 and two numbers to search for in M1 and N1.

You could get unique values from the item types by using the following formula:
Excel Formula:
=UNIQUE(TEXTSPLIT(C2,",",,TRUE,0,),TRUE,FALSE)
I do not think that is required to get to the solution, but it is an option if you want.

For a True/False result of whether or not each row matches the numbers in M1 and N1 you could use the following formula:
Excel Formula:
=AND(ISNUMBER(MATCH($M$1,VALUE(D2#),0)),ISNUMBER(MATCH($N$1,VALUE(D2#),0)))
I have the formula in cell M2, copied down.

I hope that helps,

Doug
 
Upvote 0
Hi & welcome to MrExcel.
How about
Fluff.xlsm
ABCDEFGHIJKLMN
1Transaction_IDCustomer_IDItem_TypesItem Types8173
211510714417,8,5,4,17854
321510716917,9,8,1798
431512009713,8,138
54151284549,9,8,998
651512848810,10,7,10107
76151319122,2
87151347343,2,2,322
981550017313,10,8,7,5,2,2,1,1310875221
109155024845,5
1110155070877,7
1211155088878,8
13121551014912,12
14131551313512,12
15141551461213,13
16151551822513,10,1,13101
1716155189858,8,3,883
18171552049413,5,135
19181552381117,17
20191552450414,11,10,5,5,2,1,1411105521
21201552998212,12
22
View
Cell Formulas
RangeFormula
N1N1=SUM(--(BYROW(D2:K21,LAMBDA(br,SUM(COUNTIFS(br,L1:M1))=2))))
 
Upvote 1
Solution
Hi & welcome to MrExcel.
How about
Fluff.xlsm
ABCDEFGHIJKLMN
1Transaction_IDCustomer_IDItem_TypesItem Types8173
211510714417,8,5,4,17854
321510716917,9,8,1798
431512009713,8,138
54151284549,9,8,998
651512848810,10,7,10107
76151319122,2
87151347343,2,2,322
981550017313,10,8,7,5,2,2,1,1310875221
109155024845,5
1110155070877,7
1211155088878,8
13121551014912,12
14131551313512,12
15141551461213,13
16151551822513,10,1,13101
1716155189858,8,3,883
18171552049413,5,135
19181552381117,17
20191552450414,11,10,5,5,2,1,1411105521
21201552998212,12
22
View
Cell Formulas
RangeFormula
N1N1=SUM(--(BYROW(D2:K21,LAMBDA(br,SUM(COUNTIFS(br,L1:M1))=2))))

Worked like a charm! Just had to rearrange the sheet I had my searched for values in to accommodate for it. Thank you!
For reference, this equation was what needed to go into "# of Customers buying both".
Support is just F / (D+E), the percentile of how many times this occurs when someone buys product 1 or product 2. Support is the main thing I am looking for when doing this association rule stuff.


Confidence
Antecedent1Consequence1# of Customers buying Antecedent# of Customers buying Consequence# of Customers buying bothSupportAssociation, Antecedent = ConsequenceAssociation, Consequence = Antecedent
1712-->6379505913940.1218744540.218529550.275548527
175-->6379485415830.1409240630.2481580180.326122785
178-->6379483816570.1477222070.2597585830.3424969
173-->6379446315640.1442538280.2451794950.350436926
1721-->637943189520.0889969150.1492396930.220472441
1720-->637937299520.0941828250.1492396930.255296326
172-->6379347913810.1400892680.2164916130.396953147
1713-->6379330414610.1508829910.2290327640.442191283
179-->6379307012820.1356757330.2009719390.417589577
177-->6379210312120.1428908280.1899984320.576319544
171-->6379147110940.1393630570.1715002350.743711761
1716-->637984410550.1460611930.1653864241.25
1714-->63795029910.1440197650.1553535041.974103586
 
Upvote 0
Glad we could help & thanks for the feedback.
 
Upvote 0
I'm not sure how this got past me, I must have just been excited to see something work but something is wrong, though not sure where. If you look at the bottom of the pasted table, we see for pair [17,16], while 6379 people buy item 17, 844 people buy item 16.

But somehow it says 1055 people have a transaction with both 17 and 16 in it, when the maximum should be 844.

What does the =2 mean at the end of this equation? I'm having trouble following the equation, but when playing around changing this number, I noticed that making it smaller made the result increase, while making it bigger made the result decrease. I also don't think my equation to find the # of customers is wrong since it is just a simple countif()

Let me know what you think, I haven't used the lambda function for anything before, and by extension the byrow() function since it seems to only be applied to lambda functions
 
Upvote 0
Could you have a customer who has bought 17 twice?
 
Upvote 0
You're absolutely right I'm counting using a faulty method, once I fix that I don't see why it wouldn't work

149​
15517441​
17,14,9,3,20,21,8,17,2,9,12,17149320218172912
In transaction 149 someone does buy item 17 twice, along with a multitude of other things

From the RAW data I was counting from it looks like this:

Transaction_IDCustomer_IDItem_TypeItem_NumberVendor_IDDateUnits_BoughtCoupon_OriginCoupon_Value_(Cents)
14915517441175412005100
149155174411750069199535200
149155174411440453005100
1491551744112300772365100
14915517441111177404005100
149155174411087026111115100
149155174419100813635200
14915517441916071301005100
149155174418720440005100
14915517441326430480015100

Since I was counting the raw data by column C just saying "if you see number X, count it", it would mark multiple purchases of the item even though it was all within one transaction. This is the same raw data I used to formulate the sheet first posted on this thread. I'll try and come up with a way to count without repetitions

Thank you
 
Upvote 0
How about like
Fluff.xlsm
ABCDEFGHIJK
1Transaction_IDCustomer_IDItem_TypesItem Types
211510714417,8,5,17,4,17854
321510716917,9,8,17,1798
431512009713,8,138
54151284549,9,8,98
651512848810,10,7,107
76151319122,2
87151347343,2,2,32
981550017313,10,8,7,5,2,2,1,131087521
109155024845,5
1110155070877,7
1211155088878,8
13121551014912,12
14131551313512,12
15141551461213,13
16151551822513,10,1,13101
1716155189858,8,3,83
18171552049413,5,135
19181552381117,17
20191552450414,11,10,5,5,2,1,141110521
21201552998212,12
22
View
Cell Formulas
RangeFormula
D2:G2,D20:I20,D17:E18,D16:F16,D9:J9,D8:E8,D7,D10:D15,D19,D21,D4:E6,D3:F3D2=UNIQUE(TEXTSPLIT(C2,",",,1),1)+0
Dynamic array formulas.
 
Upvote 0
The way I suggested will return TRUE if both numbers are found in one row, regardless if there are multiples of the same number. If you want a count, you could do COUNTIF TRUE. There are likely a bunch of ways to do this.

Doug
 
Upvote 0

Forum statistics

Threads
1,223,956
Messages
6,175,619
Members
452,661
Latest member
Nonhle

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top