Create term base from large data set

chillzen

New Member
Joined
Feb 1, 2022
Messages
8
Office Version
  1. 2016
Platform
  1. Windows
  2. MacOS
  3. Mobile
  4. Web
Hello,

I am stuck trying to solve what may be a very simple problem. I understand logically what needs to be done I just don't have the knowledge for the best way to handle this much data and not make excel crash. I am currently limited with no access to utilize Pandas to solve this even though that seems to be the best option. If this can be done in Excel possibly using VBA that would be great. I am creating a 2-column term base for two languages (A and B).

Notes:
Columns A:N = Lang A
Columns O:AP = Lang B
Each row is a set of a word in LangA and its synonyms with LangB and its grammatical case variations
~26K rows total
Total of 67,431 LangA terms | 285,974 Lang B
Not all Language A terms have Language B equivalents and vise versa but still need to be returned (to fill in later with the translations)
Average of 2.5 LangA terms in each row and avg 10.5 LangB
Output needs to be 2 columns (may be split into multiple sheets/files if needed since it will obviously be an extremely large number of rows generated)

Below is the best sample I can provide that should help work towards a solution:

Raw Data
A TermsA TermsB TermsB Terms
A1A2B1B2
A3A4B3
A5B4B5
A6B6
A7
B7B8
B9

Desired Output
Term ATerm B
A1B1
A1B2
A2B1
A2B2
A3B3
A4B3
A5B4
A5B5
A6B6
A7
B7
B8
B9
 

Excel Facts

Repeat Last Command
Pressing F4 adds dollar signs when editing a formula. When not editing, F4 repeats last command.
Welcome to the MrExcel forum!

Give this macro a try:

VBA Code:
Sub Flatten()
Dim SrcSheet As Worksheet, TrgSheet As Worksheet, OutData() As String, MaxRows As Long
Dim MyData As Variant, OutCol As Long, ctr As Long, r As Long, c1 As Long, c2 As Long

    Set SrcSheet = Sheets("Sheet16")
    Set TrgSheet = Sheets("Sheet17")
        
    MaxRows = Rows.Count
    c1 = SrcSheet.Cells(Rows.Count, "A").End(xlUp).Row
    c2 = SrcSheet.Cells(Rows.Count, "O").End(xlUp).Row
    If c1 < c2 Then c1 = c2
    MyData = SrcSheet.Range("A1:AP" & c1).Value
    
    OutCol = 1
    ReDim OutData(1 To MaxRows, 1 To 2)
    OutData(1, 1) = "Term A"
    OutData(1, 2) = "Term B"
    ctr = 2
    
    For r = 2 To UBound(MyData)
    
        c1 = 1
        Do
            c2 = 15
            Do
                OutData(ctr, 1) = MyData(r, c1)
                OutData(ctr, 2) = MyData(r, c2)
                ctr = ctr + 1
                If ctr = MaxRows + 1 Then
                    TrgSheet.Cells(1, OutCol).Resize(MaxRows, 2).Value = OutData
                    OutCol = OutCol + 3
                    ReDim OutData(1 To MaxRows, 1 To 2)
                    OutData(1, 1) = "Term A"
                    OutData(1, 2) = "Term B"
                    ctr = 2
                End If
                c2 = c2 + 1
            Loop Until MyData(r, c2) = ""
            c1 = c1 + 1
        Loop Until MyData(r, c1) = ""
        
    Next r
    If ctr > 2 Then TrgSheet.Cells(1, OutCol).Resize(ctr - 1, 2).Value = OutData
    
End Sub

Put the name of the input sheet and the output sheet into the first two lines.
 
Upvote 0
Welcome to the MrExcel forum!

Give this macro a try:

VBA Code:
Sub Flatten()
Dim SrcSheet As Worksheet, TrgSheet As Worksheet, OutData() As String, MaxRows As Long
Dim MyData As Variant, OutCol As Long, ctr As Long, r As Long, c1 As Long, c2 As Long

    Set SrcSheet = Sheets("Sheet16")
    Set TrgSheet = Sheets("Sheet17")
       
    MaxRows = Rows.Count
    c1 = SrcSheet.Cells(Rows.Count, "A").End(xlUp).Row
    c2 = SrcSheet.Cells(Rows.Count, "O").End(xlUp).Row
    If c1 < c2 Then c1 = c2
    MyData = SrcSheet.Range("A1:AP" & c1).Value
   
    OutCol = 1
    ReDim OutData(1 To MaxRows, 1 To 2)
    OutData(1, 1) = "Term A"
    OutData(1, 2) = "Term B"
    ctr = 2
   
    For r = 2 To UBound(MyData)
   
        c1 = 1
        Do
            c2 = 15
            Do
                OutData(ctr, 1) = MyData(r, c1)
                OutData(ctr, 2) = MyData(r, c2)
                ctr = ctr + 1
                If ctr = MaxRows + 1 Then
                    TrgSheet.Cells(1, OutCol).Resize(MaxRows, 2).Value = OutData
                    OutCol = OutCol + 3
                    ReDim OutData(1 To MaxRows, 1 To 2)
                    OutData(1, 1) = "Term A"
                    OutData(1, 2) = "Term B"
                    ctr = 2
                End If
                c2 = c2 + 1
            Loop Until MyData(r, c2) = ""
            c1 = c1 + 1
        Loop Until MyData(r, c1) = ""
       
    Next r
    If ctr > 2 Then TrgSheet.Cells(1, OutCol).Resize(ctr - 1, 2).Value = OutData
   
End Sub

Put the name of the input sheet and the output sheet into the first two lines.
Eric,

This seems to be on the right track! It managed to give me all of the pairs for the LangA (A:O) with the LangB pair in column P, but it is not returning the pairs for A:O to Q:AP.
 
Upvote 0
This is what I started with:

Book1 (version 1).xlsb
ABCDEFGHIJKLMNOP
1A TermsA TermsB TermsB Terms
2A1A2B1B2
3A3A4B3
4A5B4B5
5A6B6
6A7
7B7B8
8B9
Sheet16


And what I ended up with:

Book1 (version 1).xlsb
AB
1Term ATerm B
2A1B1
3A1B2
4A2B1
5A2B2
6A3B3
7A4B3
8A5B4
9A5B5
10A6B6
11A7
12B7
13B8
14B9
Sheet17


Are these the proper columns? Do you have any gaps between words on the same row? If this looks ok, can you show me a few lines that do not work right? (The XL2BB tool is most helpful in showing the exact rows and columns of your data. Check out the button in the reply box. It's easy to download, install, and use.)
 
Upvote 0
This is what I started with:

Book1 (version 1).xlsb
ABCDEFGHIJKLMNOP
1A TermsA TermsB TermsB Terms
2A1A2B1B2
3A3A4B3
4A5B4B5
5A6B6
6A7
7B7B8
8B9
Sheet16


And what I ended up with:

Book1 (version 1).xlsb
AB
1Term ATerm B
2A1B1
3A1B2
4A2B1
5A2B2
6A3B3
7A4B3
8A5B4
9A5B5
10A6B6
11A7
12B7
13B8
14B9
Sheet17


Are these the proper columns? Do you have any gaps between words on the same row? If this looks ok, can you show me a few lines that do not work right? (The XL2BB tool is most helpful in showing the exact rows and columns of your data. Check out the button in the reply box. It's easy to download, install, and use.)
Maybe I could have been more clear in my example, apologies! Here is a screenshot of what the top of the file looks like. (I hid E:N because they are much like A:B) For A:O there should be no gaps between cells in the same row, and between P:AP no gaps either. I can try the XL2BB tool if the screenshot isn't sufficient.
 

Attachments

  • Capture.PNG
    Capture.PNG
    72.8 KB · Views: 15
Upvote 0
I don't understand. Column P looks entirely empty, as does T and V. Column R has a smattering of words. That's what I meant by "gaps". On row 2, there is a gap in columns P, R, T, and V. That's how my macro knows how to quit looking at a row, when it encounters an empty cell. Do you want me to look through every column in A:N, and every column in O:AP to see if there's something in it?
 
Upvote 0
I don't understand. Column P looks entirely empty, as does T and V. Column R has a smattering of words. That's what I meant by "gaps". On row 2, there is a gap in columns P, R, T, and V. That's how my macro knows how to quit looking at a row, when it encounters an empty cell. Do you want me to look through every column in A:N, and every column in O:AP to see if there's something in it?
Every row could be considered a set and every row will have at least one word in it. A1:N1 is our range for set 1 and would all be English synonyms of each other and we can say our identical keys. Then we have O1:AP1 range which would be the Russian translation and each column within that range is the Russian word for each part of speech/case (singular nominative | singular genitive | a second variation of the singular genitive | singular dative | singular dative second variation | etc for each part of speech) Not every Russian word has a 2nd variation of spelling for the genitive case for example, so that call would be black for that set.

There are also some sets of Russian words that the database doesn't have any English translations or synonyms so for that set, columns A:N would be blank.
 
Upvote 0
I don't understand. Column P looks entirely empty, as does T and V. Column R has a smattering of words. That's what I meant by "gaps". On row 2, there is a gap in columns P, R, T, and V. That's how my macro knows how to quit looking at a row, when it encounters an empty cell. Do you want me to look through every column in A:N, and every column in O:AP to see if there's something in it?
I am working on a better example of my input/output model for this for explanation purposes and will post as soon as done.
 
Upvote 0
Input Example
Book1
ABCDEFGH
1EnglishEnglishEnglishEnglishRussianRussianRussianRussian
2personpeoplemanR1R2R3
3yearR4R5R6R7
4eye
5R8
6R9
7friendpalR10
Sheet1


Desired output
Book1
AB
18EnglishRussian
19personR1
20personR2
21personR3
22peopleR1
23peopleR2
24peopleR3
25manR1
26manR2
27manR3
28yearR4
29yearR5
30yearR6
31yearR7
32eye
33R8
34R9
35friendR10
36friendR10
Sheet1


A:D in the example represents A:P - English range
E:H in the example represents O:AP - Russian range and it's spelling variations depending on the part of speech.
Not all words have a variation for a different part of speech so it would have a blank in that spot

IF there is a value in the English range, there will not be a blank gap between English cells, but may have blanks in the right cells.
IF there are values in the Russian range, there may be a blank gap in any of the cells.
 
Upvote 0
[EDIT] Correcting range. A:N = English. O:AP = Russian

Input Example
Book1
ABCDEFGH
1EnglishEnglishEnglishEnglishRussianRussianRussianRussian
2personpeoplemanR1R2R3
3yearR4R5R6R7
4eye
5R8
6R9
7friendpalR10
Sheet1


Desired output
Book1
AB
18EnglishRussian
19personR1
20personR2
21personR3
22peopleR1
23peopleR2
24peopleR3
25manR1
26manR2
27manR3
28yearR4
29yearR5
30yearR6
31yearR7
32eye
33R8
34R9
35friendR10
36friendR10
Sheet1


A:D in the example represents A:N - English range
E:H in the example represents O:AP - Russian range and it's spelling variations depending on the part of speech.
Not all words have a variation for a different part of speech so it would have a blank in that spot

IF there is a value in the English range, there will not be a blank gap between English cells, but may have blanks in the right cells.
IF there are values in the Russian range, there may be a blank gap in any of the cells.
 
Upvote 0

Forum statistics

Threads
1,223,237
Messages
6,170,928
Members
452,366
Latest member
TePunaBloke

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top