Association Rule mining using excel; Count how many rows have two searched for numbers

DoctorofMadness · Mar 13, 2023

I'm a college student working on a report currently and am having an issue with a problem I've recently run into and it's really throwing me for a loop. I thought I found a solution in this post, but couldn't get it to output anything but 0.
[[ Count how many times in total two values appear together in a row for all the rows ]]

To begin the project I was given a dataset of a little under 18000 transactions, with 19 different items that may have been purchased. The goal of the project is to draw what item types have associations with each other; for instance, if someone buys product 17, what is the confidence we can have that they would also purchase product 4 in the same transaction. The way I have the sheet looks as follows (first 20 transactions)

Transaction_ID	Customer_ID	Item_Types	Item Types
1	15107144	17,8,5,4,	17	8	5	4
2	15107169	17,9,8,	17	9	8
3	15120097	13,8,	13	8
4	15128454	9,9,8,	9	9	8
5	15128488	10,10,7,	10	10	7
6	15131912	2,	2
7	15134734	3,2,2,	3	2	2
8	15500173	13,10,8,7,5,2,2,1,	13	10	8	7	5	2	2	1
9	15502484	5,	5
10	15507087	7,	7
11	15508887	8,	8
12	15510149	12,	12
13	15513135	12,	12
14	15514612	13,	13
15	15518225	13,10,1,	13	10	1
16	15518985	8,8,3,	8	8	3
17	15520494	13,5,	13	5
18	15523811	17,	17
19	15524504	14,11,10,5,5,2,1,	14	11	10	5	5	2	1
20	15529982	12,	12

On the right side of the spreadsheet, the data is just pulled out of the Item_Types column using Textsplit() to remove the commas.

I thought I was onto something with the thread I linked above but cannot get it to output anything except 0, but I also have two theories onto why it doesn't work; each transaction might see the same item precedent or antecedent ( first item or second item ) more than once (see Transaction_ID 19, #5 appears twice).

I also decided to write the formula with 100 columns in the mmult() part, since I thought the large amount of blank space wouldn't affect anything since the way I am interpreting it to work.

=SUM((MMULT(--('Transaction IDs + Item Types'!D2:CY17918=L3),{1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1})>0)
*(MMULT(--('Transaction IDs + Item Types'!D2:CY17918=O3),{1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1})>0))

Let me know if you have any ideas. I am trying to figure a way to write a formula that will check if a row contains two values I am looking for anywhere in the row, while allowing the same number(s) to potentially appear more than once. I also know there are different ways this could be done, one of which using powerpivot, others using RStudio, I just haven't used those tools before and would rather try and do it on software I understand before resorting to learning new tools.

duggie33 · Mar 13, 2023

Hi DoctorofMadness,

With the data you have shown above in range A1:K:21 and two numbers to search for in M1 and N1.

You could get unique values from the item types by using the following formula:

Excel Formula:

=UNIQUE(TEXTSPLIT(C2,",",,TRUE,0,),TRUE,FALSE)

I do not think that is required to get to the solution, but it is an option if you want.

For a True/False result of whether or not each row matches the numbers in M1 and N1 you could use the following formula:

Excel Formula:

=AND(ISNUMBER(MATCH($M$1,VALUE(D2#),0)),ISNUMBER(MATCH($N$1,VALUE(D2#),0)))

I have the formula in cell M2, copied down.

I hope that helps,

Doug

Fluff · Mar 14, 2023

Hi & welcome to MrExcel.
How about

Fluff.xlsm

A

B

C

D

E

F

G

H

I

J

K

L

M

N

1

Transaction_ID

Customer_ID

Item_Types

Item Types

8

17

3

2

1

15107144

17,8,5,4,

17

8

5

4

3

2

15107169

17,9,8,

17

9

8

4

3

15120097

13,8,

13

8

5

4

15128454

9,9,8,

9

8

6

5

15128488

10,10,7,

10

7

6

15131912

2,

2

8

7

15134734

3,2,2,

3

2

9

8

15500173

13,10,8,7,5,2,2,1,

13

10

8

7

5

2

1

10

9

15502484

5,

5

11

10

15507087

7,

7

12

11

15508887

8,

8

13

12

15510149

12,

12

14

13

15513135

12,

12

15

14

15514612

13,

13

16

15

15518225

13,10,1,

13

10

1

17

16

15518985

8,8,3,

8

3

18

17

15520494

13,5,

13

5

19

18

15523811

17,

17

20

19

15524504

14,11,10,5,5,2,1,

14

11

10

5

2

1

21

20

15529982

12,

12

22

View

Cell Formulas
Range	Formula
N1	N1	=SUM(--(BYROW(D2:K21,LAMBDA(br,SUM(COUNTIFS(br,L1:M1))=2))))

DoctorofMadness · Mar 14, 2023

Fluff said:
Hi & welcome to MrExcel.
How about

Fluff.xlsm
A B C D E F G H I J K L M N
1 Transaction_ID Customer_ID Item_Types Item Types 8 17 3
2 1 15107144 17,8,5,4, 17 8 5 4
3 2 15107169 17,9,8, 17 9 8
4 3 15120097 13,8, 13 8
5 4 15128454 9,9,8, 9 9 8
6 5 15128488 10,10,7, 10 10 7
7 6 15131912 2, 2
8 7 15134734 3,2,2, 3 2 2
9 8 15500173 13,10,8,7,5,2,2,1, 13 10 8 7 5 2 2 1
10 9 15502484 5, 5
11 10 15507087 7, 7
12 11 15508887 8, 8
13 12 15510149 12, 12
14 13 15513135 12, 12
15 14 15514612 13, 13
16 15 15518225 13,10,1, 13 10 1
17 16 15518985 8,8,3, 8 8 3
18 17 15520494 13,5, 13 5
19 18 15523811 17, 17
20 19 15524504 14,11,10,5,5,2,1, 14 11 10 5 5 2 1
21 20 15529982 12, 12
22
View
Cell Formulas
Range Formula
N1 N1 =SUM(--(BYROW(D2:K21,LAMBDA(br,SUM(COUNTIFS(br,L1:M1))=2))))

Worked like a charm! Just had to rearrange the sheet I had my searched for values in to accommodate for it. Thank you!
For reference, this equation was what needed to go into "# of Customers buying both".
Support is just F / (D+E), the percentile of how many times this occurs when someone buys product 1 or product 2. Support is the main thing I am looking for when doing this association rule stuff.

							Confidence
Antecedent1	Consequence1		# of Customers buying Antecedent	# of Customers buying Consequence	# of Customers buying both	Support	Association, Antecedent = Consequence	Association, Consequence = Antecedent
17	12	-->	6379	5059	1394	0.121874454	0.21852955	0.275548527
17	5	-->	6379	4854	1583	0.140924063	0.248158018	0.326122785
17	8	-->	6379	4838	1657	0.147722207	0.259758583	0.3424969
17	3	-->	6379	4463	1564	0.144253828	0.245179495	0.350436926
17	21	-->	6379	4318	952	0.088996915	0.149239693	0.220472441
17	20	-->	6379	3729	952	0.094182825	0.149239693	0.255296326
17	2	-->	6379	3479	1381	0.140089268	0.216491613	0.396953147
17	13	-->	6379	3304	1461	0.150882991	0.229032764	0.442191283
17	9	-->	6379	3070	1282	0.135675733	0.200971939	0.417589577
17	7	-->	6379	2103	1212	0.142890828	0.189998432	0.576319544
17	1	-->	6379	1471	1094	0.139363057	0.171500235	0.743711761
17	16	-->	6379	844	1055	0.146061193	0.165386424	1.25
17	14	-->	6379	502	991	0.144019765	0.155353504	1.974103586

Fluff · Mar 14, 2023

Glad we could help & thanks for the feedback.

DoctorofMadness · Mar 14, 2023

I'm not sure how this got past me, I must have just been excited to see something work but something is wrong, though not sure where. If you look at the bottom of the pasted table, we see for pair [17,16], while 6379 people buy item 17, 844 people buy item 16.

But somehow it says 1055 people have a transaction with both 17 and 16 in it, when the maximum should be 844.

What does the =2 mean at the end of this equation? I'm having trouble following the equation, but when playing around changing this number, I noticed that making it smaller made the result increase, while making it bigger made the result decrease. I also don't think my equation to find the # of customers is wrong since it is just a simple countif()

Let me know what you think, I haven't used the lambda function for anything before, and by extension the byrow() function since it seems to only be applied to lambda functions

Fluff · Mar 14, 2023

Could you have a customer who has bought 17 twice?

DoctorofMadness · Mar 14, 2023

You're absolutely right I'm counting using a faulty method, once I fix that I don't see why it wouldn't work

149

15517441

17,14,9,3,20,21,8,17,2,9,12,

17

14

9

3

20

21

8

17

2

9

12

In transaction 149 someone does buy item 17 twice, along with a multitude of other things

From the RAW data I was counting from it looks like this:

Transaction_ID

Customer_ID

Item_Type

Item_Number

Vendor_ID

Date

Units_Bought

Coupon_Origin

Coupon_Value_(Cents)

149	15517441	17	5	41200	5	1
149	15517441	17	50069	19953	5	2
149	15517441	14	40	45300	5	1
149	15517441	12	300	77236	5	1
149	15517441	11	1177	40400	5	1
149	15517441	10	87026	11111	5	1
149	15517441	9	100	81363	5	2
149	15517441	9	16071	30100	5	1
149	15517441	8	720	44000	5	1
149	15517441	3	26430	48001	5	1

Since I was counting the raw data by column C just saying "if you see number X, count it", it would mark multiple purchases of the item even though it was all within one transaction. This is the same raw data I used to formulate the sheet first posted on this thread. I'll try and come up with a way to count without repetitions

Thank you

Fluff · Mar 14, 2023

How about like

Fluff.xlsm

A

B

C

D

E

F

G

H

I

J

K

1

Transaction_ID

Customer_ID

Item_Types

Item Types

2

1

15107144

17,8,5,17,4,

17

8

5

4

3

2

15107169

17,9,8,17,

17

9

8

4

3

15120097

13,8,

13

8

5

4

15128454

9,9,8,

9

8

6

5

15128488

10,10,7,

10

7

6

15131912

2,

2

8

7

15134734

3,2,2,

3

2

9

8

15500173

13,10,8,7,5,2,2,1,

13

10

8

7

5

2

1

10

9

15502484

5,

5

11

10

15507087

7,

7

12

11

15508887

8,

8

13

12

15510149

12,

12

14

13

15513135

12,

12

15

14

15514612

13,

13

16

15

15518225

13,10,1,

13

10

1

17

16

15518985

8,8,3,

8

3

18

17

15520494

13,5,

13

5

19

18

15523811

17,

17

20

19

15524504

14,11,10,5,5,2,1,

14

11

10

5

2

1

21

20

15529982

12,

12

22

View

Cell Formulas
Range	Formula
D2:G2,D20:I20,D17:E18,D16:F16,D9:J9,D8:E8,D7,D10:D15,D19,D21,D4:E6,D3:F3	D2	=UNIQUE(TEXTSPLIT(C2,",",,1),1)+0
Dynamic array formulas.

duggie33 · Mar 14, 2023

The way I suggested will return TRUE if both numbers are found in one row, regardless if there are multiples of the same number. If you want a count, you could do COUNTIF TRUE. There are likely a bunch of ways to do this.

Doug

Association Rule mining using excel; Count how many rows have two searched for numbers

DoctorofMadness

New Member

duggie33

Active Member

Fluff

MrExcel MVP, Moderator

DoctorofMadness

New Member

Fluff

MrExcel MVP, Moderator

DoctorofMadness

New Member

Fluff

MrExcel MVP, Moderator

DoctorofMadness

New Member

Fluff

MrExcel MVP, Moderator

duggie33

Active Member

Similar threads

Share this page

Association Rule mining using excel; Count how many rows have two searched for numbers

New Member

Active Member

MrExcel MVP, Moderator

New Member

MrExcel MVP, Moderator

New Member

MrExcel MVP, Moderator

New Member

MrExcel MVP, Moderator

Active Member

Similar threads

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock