Searching for degree symbol in LUA string variable

Moho allows users to write new tools and plugins. Discuss scripting ideas and problems here.

Moderators: Víctor Paredes, Belgarath, slowtiger

Post Reply
User avatar
heyvern
Posts: 7035
Joined: Fri Sep 02, 2005 4:49 am

Searching for degree symbol in LUA string variable

Post by heyvern »

I am updating my "Simple Stroke Text" script, adding some missing characters.
I am trying to do a search on a string for the degree symbol ° (hope that displays).
This character has several options, you can type it using alt+0176 or alt+248, unicode is U+00B0.

So far searching for the "literal string" value of the degree symbol in a text variable returns zilch in the script.
The odd thing is it will PRINT to the consol perfectly fine but ONLY if it's typed in the script directly like this °

Code: Select all

print("°")
I've tried every way I can find in lua to search for it using:

Code: Select all

string.char(248) -- returns nada, zilch
string.byte("°") -- same here nothing nada zilch
So basically what is found during a string search for this character is what I think is a UTF-8 code for a nonbreaking space on windows but just the first bit because it contains a space 194 160. Lua of course doesn't really do UTF-8.

UPDATE
UTF-8 for the degree symbol is 194 176. So the search is "correct-ish". However since lua doesn't understand UTF-8 I still need a way to convert it.

Code: Select all

if (string.byte(txt) == 194) then ...
this works and returns true. However, it's the first half of the nonbreaking space UTF-8 code not the degree symbol. there doesn't appear to be a way to search for this character even though Moho's scripting interface and LUA can see it and print it in the script.

Any help is appreciated. I could use substitute values as a placeholder but I'm checking each individual letter to draw them to vectors.
User avatar
heyvern
Posts: 7035
Joined: Fri Sep 02, 2005 4:49 am

Re: Searching for degree symbol in LUA string variable

Post by heyvern »

Okay well, will update here for anyone else with similar issues with "special" characters in LUA but I think it's pretty much solved. Always happens like this. Minutes after posting I figure it out... well... mostly.

So yes I was right. The degree symbol returns the UTF-8 value "194 176". This mucks up the whole dang thing kind of sort of but is confusing.

With txt variable containing what is supposed to be a single letter from the input box string:

Code: Select all

print(string.byte(txt))
When there is a degree symbol the variable gets "split" because the result is two numbers separated by a space.

For example:
The string of text being searched letter by letter is "f ° g"
These are the string.byte() values returned
102 -- f
32 -- space
194 -- first half of degree symbol
176 -- second half of degree symbol
32 -- space
103 -- g

The "space" between the two values of the degree symbol is ignored for some reason.

There will probably be a bunch of "odd" characters like this so I am going to come up with some sort of workaround matching groups of those values. The byte code values only go up so far for each "standard" single set of values. For example A-Z is 65-90 a-z is 97-122. The task is finding the odd whatchamacallits and matching them specifically by UTF-8 values.
User avatar
synthsin75
Posts: 9978
Joined: Mon Jan 14, 2008 11:20 pm
Location: Oklahoma
Contact:

Re: Searching for degree symbol in LUA string variable

Post by synthsin75 »

Would something like this give you a way to search the code of each character?

Code: Select all

	utf8str = "🔒"
	function Utf8to32(utf8str)
		assert(type(utf8str) == "string")
		local res, seq, val = {}, 0, nil
		for i = 1, #utf8str do
			local c = string.byte(utf8str, i)
			if seq == 0 then
				table.insert(res, val)
				seq = c < 0x80 and 1 or c < 0xE0 and 2 or c < 0xF0 and 3 or
					c < 0xF8 and 4 or --c < 0xFC and 5 or c < 0xFE and 6 or
					error("invalid UTF-8 character sequence")
				val = bit32.band(c, 2^(8-seq) - 1)
			else
				val = bit32.bor(bit32.lshift(val, 6), bit32.band(c, 0x3F))
			end
			seq = seq - 1
		end
		table.insert(res, val)
		table.insert(res, 0)
		return res
	end
	local res = Utf8to32(utf8str)
	for i,v in ipairs(res) do
		print(i, "  ", res[i])
	end
The result gives the position and code for each character in the string. You'd just have to search this table, instead of the string directly.

If I understand what you're needing.
User avatar
hayasidist
Posts: 3525
Joined: Wed Feb 16, 2011 8:12 pm
Location: Kent, England

Re: Searching for degree symbol in LUA string variable

Post by hayasidist »

UTF-8 is a real headache right now. One UTF-8 character can use anything from 1 to 4 bytes. It is clearly possible (as Wes has exemplified) to detect UTF-8 characters in a string -- if the byte you inspect has its top two bits set, it's a multi-byte character. 110xxxxx is 2bytes; 1110xxxx is 3; 11110xxx is 4. each additional byte should have its top two bits set to 10 -- i.e. each is of pattern 10xxxxxx.

And as Vern has noticed, #string counts the bytes not the characters so (e.g.) if we have s = "€€€", #s is 9; and the pattern for each symbol is 0xE282AC (binary 11100010, 10000010, 10101100)

a look at the Lua 5.3 manual offers some help, most notably the string pattern "[\0-\x7F\xC2-\xFD][\x80-\xBF]*" which it says can be used in string.find etc. (I haven't tried it-- but the literal string, not the functions / constants, should be perfectly usable in 5.2)

enjoy!
User avatar
heyvern
Posts: 7035
Joined: Fri Sep 02, 2005 4:49 am

Re: Searching for degree symbol in LUA string variable

Post by heyvern »

synthsin75 wrote: Sat Jun 24, 2023 5:35 am Would something like this give you a way to search the code of each character?....
Holy cow cool beans! Yes... I don't... fully understand it but the results seem to work. I only just tested it really fast with that dagnab stupid degree symbol and it works great. Returns 176 which is the ASCII code for the degree symbol. It should be a simple matter of sending wonky "letters" into the function to get a result I can use.

Thanks, guys for the feedback.

p.s. So it's weird. Some of the oddball characters work fine without doing anything fancy, but some just aren't recognized. So this solution should work for those oddball weird ones. For example, the dagnab degree symbol is all kinds of wonky but a crazy double dagger ‡ has no issues at all. What the freaking heck?

p.s.s. Nope wrong. Double dagger is also wonky but not likely to end up in my script anyway.
User avatar
heyvern
Posts: 7035
Joined: Fri Sep 02, 2005 4:49 am

Re: Searching for degree symbol in LUA string variable

Post by heyvern »

well don't I feel a tad foolish.

So apparently I was making this harder by using string.len(). This is very useful but gives length of a string in BYTES not characters. So weird characters that have their bytes in groups like bananas would mess up the for loops. Only reason it worked relatively okay was that I was only using mostly ascii plain vanilla text. As soon as I added in funky stuff it... broke.

Easy peasy fix though. I simply had to create the length of my for loop based on filtering for those nasty unruly characters that were giving me a hard time. I started to think of these symbols like juvenile delinquents hanging out at the corner smoking and drinking, yelling at people as they walk by. Real trouble makers.

this is the code I'm using now and it seems to work great for everything. AND the BIG BONUS is I can use the actual symbol in the script to test for it. works a treat.

Code: Select all

for character in string.gmatch(simpleText, "([%z\1-\127\194-\244][\128-\191]*)") do
     --print(character)
     textTable[ct] = character
     ct=ct+1
end
Post Reply