Problems Creating a Chinese Corpus (Can I SEARCH() a Range?)

corpusworker

New Member
Joined
Nov 18, 2008
Messages
19
Hello all. I dig a lot of these problem/solution things, but did not see one that matched my search criteria, and so am turning to experts for some advice.

To start, I'm only moderately skilled with Excel, and I work primarily with the logical, not mathematical functions. That being said, I'd like to know if there is a good method for Excel to solve this problem. I've been working at it in a rather sloppy manner until I began researching some other approaches.

The Problem(s):

I am developing a site that teaches the grammar of modern Chinese at the same time that it builds vocabulary by frequency. There are a couple of issues with this:

  1. Chinese word corpi are virtually nonexistent, and where existent, sloppy.
  2. Chinese character corpi are more available and somewhat more reliable.
The difference between a word and a character in Chinese can be a bit hazy, but the easiest way to put it is that not all characters stand alone as word and that most words are two to three characters in length.

Therefore, using data from the character corpus and a few thorough dictionaries, I created a word list in Excel that has the words and definitions in separate columns.

I have likewise found a character corpus that I have appropriately organized by ranking and other relevant data. The characters occupy a column.

Here's the basic logical function I am trying to create in separate columns, which are organized by frequency of the character:

=IF(AND(OR(contains any characters in the character corpus from frequency x to frequency y),NOT(OR(contains any characters in the character corpus from frequency y+1 to z)),Word Column,"")

Now, I know now that I have a good workaround, so I can shorten this to...

=IF(NOT(OR(contains any characters in the character corpus from frequency y+1 to z)),Word Column,"")

The problem is that I don't see any effective means of efficiently coding the OR condition!

Is there some magic that can be done so that I can use the ISNUMBER(SEARCH(characters,word cell) function for multiple characters?

I definitely don't want to go...
=IF(NOT(OR(ISNUMBER(SEARCH("球",word cell)),ISNUMBER(SEARCH("式",same word cell),...),Word Column,"") for thousands upon thousands of characters!

Is there a way for me to select a range of values within the SEARCH() function to allow this kind of programming?

I also had a look at the filters, but found myself facing the same problem.

I don't want to abandon hope, but I also don't want to spend twenty years on what should be a fifty-day project.

Thanks if you can help!
 

Excel Facts

Remove leading & trailing spaces
Save as CSV to remove all leading and trailing spaces. It is faster than using TRIM().
Hi,

I'm not crystal clear on the requirement, however it seems this might be a good case for using some VBA and creating a 'user defined function' (UDF). By using programming, just about anything can be done.

It would help if you post some sample data, as this should greatly assist with understanding.

regards, Fazza
 
Upvote 0
Sure. I'll give a basic example:

Here is a list of words of the Chinese language, which I put in Column A:

<table x:str="" style="border-collapse: collapse; width: 106pt;" width="141" border="0" cellpadding="0" cellspacing="0"><col style="width: 106pt;" width="141"> <tbody><tr style="height: 16.5pt;" height="22"> <td class="xl24" style="height: 16.5pt; width: 106pt;" x:str="幸好 " width="141" height="22">幸好 </td> </tr> <tr style="height: 16.5pt;" height="22"> <td class="xl24" style="height: 16.5pt; width: 106pt;" x:str="能幹 " width="141" height="22">能幹 </td> </tr> <tr style="height: 16.5pt;" height="22"> <td class="xl24" style="height: 16.5pt; width: 106pt;" x:str="好幾 " width="141" height="22">好幾 </td> </tr> <tr style="height: 16.5pt;" height="22"> <td class="xl24" style="height: 16.5pt; width: 106pt;" x:str="幾個 " width="141" height="22">幾個 </td> </tr> <tr style="height: 16.5pt;" height="22"> <td class="xl24" style="height: 16.5pt; width: 106pt;" x:str="幾乎沒有 " width="141" height="22">幾乎沒有 </td> </tr> <tr style="height: 16.5pt;" height="22"> <td class="xl24" style="height: 16.5pt; width: 106pt;" x:str="幾何學 " width="141" height="22">幾何學 </td> </tr> <tr style="height: 16.5pt;" height="22"> <td class="xl24" style="height: 16.5pt; width: 106pt;" x:str="幾天來 " width="141" height="22">幾天來 </td> </tr> <tr style="height: 16.5pt;" height="22"> <td class="xl24" style="height: 16.5pt; width: 106pt;" x:str="幾個小時 " width="141" height="22">幾個小時 </td> </tr> <tr style="height: 16.5pt;" height="22"> <td class="xl24" style="height: 16.5pt; width: 106pt;" x:str="好幾年 " width="141" height="22">好幾年 </td> </tr> <tr style="height: 16.5pt;" height="22"> <td class="xl24" style="height: 16.5pt; width: 106pt;" x:str="幾年來 " width="141" height="22">幾年來 </td> </tr> <tr style="height: 16.5pt;" height="22"> <td class="xl24" style="height: 16.5pt; width: 106pt;" x:str="有序 " width="141" height="22">有序 </td> </tr> </tbody></table>
Now, let's say I want to filter out all of the words except for the ones that have the characters 好, 幾, 年, and 來.

The problem is that I have to work by elimination, not by addition, and so IF(NOT(OR(跑,工,忙,相,光...),"",A).

The list of characters, however, is more than 2500 big, so I can't just write, "ISNUMBER(SEARCH(工,A3))," and so on 2496 times just to isolate the few that I want. I want to know if there is a way around it. I have put the character list on a spreadsheet, as well, and want to know if I can just refer to an array, like if there's a "NOT(ANY(D5:D2550)" function to use, or if there's a shortcut I'm missing, or something of that sort.
 
Upvote 0
Hi,

The sample has a list of words in column A.

You write you want to filter out records that don't contain some (listed) characters. This list of characters can be large.

To me, this would result in a dataset of words that don't contain the listed characters. Though the example of "=IF(not(...), etc" suggests only a TRUE/FALSE indication is required. I don't understand.

With a simple example, what would the result be?

F
 
Upvote 0
Right, I want to disqualify words that are too rare for beginning lessons, and then gradually disqualify fewer and fewer until one has reached the full corpus of the word set.

For the data set I provided, removing all of the character sets except for the four mention would get from...

Column A

<table x:str="" style="border-collapse: collapse; width: 106pt;" width="141" border="0" cellpadding="0" cellspacing="0"><tbody><tr style="height: 16.5pt;" height="22"><td class="xl24" style="height: 16.5pt; width: 106pt;" x:str="幸好 " width="141" height="22">幸好 </td> </tr> <tr style="height: 16.5pt;" height="22"> <td class="xl24" style="height: 16.5pt; width: 106pt;" x:str="能幹 " width="141" height="22">能幹 </td> </tr> <tr style="height: 16.5pt;" height="22"> <td class="xl24" style="height: 16.5pt; width: 106pt;" x:str="好幾 " width="141" height="22">好幾 </td> </tr> <tr style="height: 16.5pt;" height="22"> <td class="xl24" style="height: 16.5pt; width: 106pt;" x:str="幾個 " width="141" height="22">幾個 </td> </tr> <tr style="height: 16.5pt;" height="22"> <td class="xl24" style="height: 16.5pt; width: 106pt;" x:str="幾乎沒有 " width="141" height="22">幾乎沒有 </td> </tr> <tr style="height: 16.5pt;" height="22"> <td class="xl24" style="height: 16.5pt; width: 106pt;" x:str="幾何學 " width="141" height="22">幾何學 </td> </tr> <tr style="height: 16.5pt;" height="22"> <td class="xl24" style="height: 16.5pt; width: 106pt;" x:str="幾天來 " width="141" height="22">幾天來 </td> </tr> <tr style="height: 16.5pt;" height="22"> <td class="xl24" style="height: 16.5pt; width: 106pt;" x:str="幾個小時 " width="141" height="22">幾個小時 </td> </tr> <tr style="height: 16.5pt;" height="22"> <td class="xl24" style="height: 16.5pt; width: 106pt;" x:str="好幾年 " width="141" height="22">好幾年 </td> </tr> <tr style="height: 16.5pt;" height="22"> <td class="xl24" style="height: 16.5pt; width: 106pt;" x:str="幾年來 " width="141" height="22">幾年來 </td> </tr> <tr style="height: 16.5pt;" height="22"> <td class="xl24" style="height: 16.5pt; width: 106pt;" x:str="有序 " width="141" height="22">有序 </td></tr></tbody></table>
..by eliminating/disqualifying entries with any character except for 好, 幾, 年, or 來, to...

Column B

好幾
好幾年
幾年來


Notice that I made Column B remove words that had more than the characters that I allowed.

Sorry, I was using shorthand and mixed the consequences.

It should read, "=IF(NOT(OR(跑,工,忙,相,光...),A3,"")." I wrote the Chinese characters in place of "ISNUMBER(SEARCH(跑,A3))" to save space, so in full the characters would have to be, =IF(NOT(OR(ISNUMBER(SEARCH(跑,A3)),ISNUMBER(SEARCH(工,A3)),ISNUMBER(SEARCH(忙,A3)),ISNUMBER(SEARCH(相,A3)),ISNUMBER(SEARCH(光,A3))...),A3,"")

The problem is exactly this. Even with only six characters, the equation is ridiculously big, and I don't think I could even fit 2500 on the grid without causing an error.
 
Upvote 0
Hi,

The code below assumes the words are in the active sheet from cell A1 downwards, characters from C1 downwards and results in E1 downwards, in a contiguous block for each. Column "E" (5) is erased before putting the results to the worksheet. Please modify to suit different set up.

ALT-F11 from the data worksheet to the code window. Copy & paste the code into the big window. Then position the cursor somewhere within the code and hit F5 to execute the code. Or, from in Excel, ALT-F8 and run it that way.

HTH, Fazza

Code:
Sub test()
 
  Dim bAllCharactersFound As Boolean
  Dim i As Long, j As Long, k As Long
  Dim strTemp As String
  Dim arWords() As Variant
  Dim arCharacters() As Variant
  Dim arResults() As Variant
  Dim dicChar As Object
 
  With Range("A1").CurrentRegion
    ReDim arWords(1 To .Rows.Count, 1 To 1)
    ReDim arResults(1 To .Rows.Count, 1 To 1)
    arWords = .Value2
  End With
 
  With Range("C1").CurrentRegion
    ReDim arCharacters(1 To .Rows.Count, 1 To 1)
    arCharacters = .Value2
  End With
 
  'Load characters into a collection
  Set dicChar = CreateObject("Scripting.Dictionary")
  For i = LBound(arCharacters, 1) To UBound(arCharacters, 1)
    If Not dicChar.exists(arCharacters(i, 1)) Then dicChar.Add arCharacters(i, 1), Nothing
  Next i
 
  'Loop through words
  For i = LBound(arWords, 1) To UBound(arWords, 1)
 
    strTemp = arWords(i, 1)
    bAllCharactersFound = True
 
    'for each word, loop through each character
    For j = 1 To Len(strTemp)
      If Not dicChar.exists(Mid$(strTemp, j, 1)) Then
        bAllCharactersFound = False
        Exit For
      End If
    Next j
 
    If bAllCharactersFound Then
      k = k + 1
      arResults(k, 1) = strTemp
    End If
 
  Next i
 
  'Erase old results before pasting these results, if there are any
  Columns(5).ClearContents
  If k Then Range("E1").Resize(k, 1).Value2 = arResults
 
End Sub
 
Upvote 0
This is quite awesome! I'll try it and see if it works. Are there any sites that will teach me the scripting language you used?
 
Upvote 0
I tried the code in a couple of different ways, even leaving it the same and inputting the data into the proper rows. My results only deleted everything in the column where the other data would have gone.

Anyway, thanks for trying. I'll find a way around this somehow.
 
Upvote 0
I don't understand your comments - trying different ways, inputting the data into the proper rows.

I don't know what different way, or why. As above, data is assumed to be in columns, not rows. It works for me for the set up I described and the data you posted. Code is necessarily specific to the set up. If you are testing with different code or different set up please advise what you're doing as I can't possibly help without knowing. I suggest this is worth pursuing as it does work AFAIK.

regards, Fazza
 
Upvote 0
Oh, I just used an experimental document to get it done. I put the words in column A starting at A1, the characters in column C starting at C1, and left E1 blank.

It does erase contents at E1, but it does not put another block in its place.

I think I'll put a picture of my spreadsheet here to give some further information of the type of layout with which I am working.
 
Upvote 0

Forum statistics

Threads
1,223,214
Messages
6,170,771
Members
452,353
Latest member
strainu

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top