Help identify mystery character in SDLXLIFF file (for MS Word) Thread poster: Samuel Murray
|
Samuel Murray Netherlands Local time: 12:02 Member (2006) English to Afrikaans + ... |
A grey square in a box? | Aug 20, 2015 |
I have downloaded the file and this is what I get. I tried to match it in some font types such as webdings.. No success.. It maybe another substitute for characters which cannot be shown in a certain font type, just like the empty box.. | | |
Samuel Murray Netherlands Local time: 12:02 Member (2006) English to Afrikaans + ... TOPIC STARTER
Samuel Murray wrote: Can anyone please tell me what is the character between the two brackets in this file...? My hex editor tells me the character is EF BB BF, which, incidentally, is the same as the character at the start of the file, i.e. a UTF8 byte order mark. However, in my SDLXLIFF file, this character occurs in places where I might expect bullets in a bullet list. I can see the character in MS Word, but I can't copy it to the clipboard, so I'm going to have to learn how to type it in the find/replace box to be able to manipulate it. | | |
Joakim Braun Sweden Local time: 12:02 German to Swedish + ...
|
|
Samuel Murray Netherlands Local time: 12:02 Member (2006) English to Afrikaans + ... TOPIC STARTER
Joakim Braun wrote: Would this help? Nope, I've tried that already, sorry. The one method is to convert the hex code to decimal (e.g. using an online converter) and then type ^u0000 (where 0000 is the decimal code) in the search box. However, the decimal code for BB BF is 48063, and for EF BB BF it is 15711167, and neither of these codes find the mystery character. The other method is to type the hex code and then press Alt+X. This method only works if the hex code has four letters/numbers, not more. The Alt+X conversion of BB BF is 뮿, which is not my character either. | | |
Platary (X) Local time: 12:02 German to French + ... In my Ms Word | Aug 20, 2015 |
The mystery character looks like that : >< I can copy it and replace it with what I want (just a char or a word). I use for that the Windows encoding by default. Hope this helps! Regards | | |
Samuel Murray Netherlands Local time: 12:02 Member (2006) English to Afrikaans + ... TOPIC STARTER
Adrien Esparron wrote: The mystery character looks like that: >< I use for that the Windows encoding by default. The encoding of the text file is UTF8 with BOM (sorry, perhaps I should have mentioned it, but MS Word is usually pretty good at guessing text files' encoding and I had thought that all installations of MS Word will successfully identify the file as UTF8 with BOM). If you open a text file that is encoded in one encoding as if it is encoded in another encoding (i.e., what you have done), then different characters will be displayed. If you want to see how this character is displayed in MS Word, then don't select "Windows (Default)" as the encoding, but "Other encoding: Unicode (UTF8)" when opening the file in MS Word. | | |
Joakim Braun Sweden Local time: 12:02 German to Swedish + ... Reversed byte order? | Aug 20, 2015 |
Samuel Murray wrote: The Alt+X conversion of BB BF is 뮿, which is not my character either. And BF BB? Reversed byte order - worth a try. | |
|
|
Zero-Width No-Break Space | Aug 20, 2015 |
In UTF-8, EF BB BF is a zero-width no-break space (see: http://www.fileformat.info/info/charset/UTF-8/list.htm?start=43024 ) In Word, the equivalent character is called “No-Space Non Break” and on my system (Word 2000 / Win XP* ) it can be inserted it into a document via the “Insert Symbol” dialogue, “Special Characters” tab, last item in the lis... See more In UTF-8, EF BB BF is a zero-width no-break space (see: http://www.fileformat.info/info/charset/UTF-8/list.htm?start=43024 ) In Word, the equivalent character is called “No-Space Non Break” and on my system (Word 2000 / Win XP* ) it can be inserted it into a document via the “Insert Symbol” dialogue, “Special Characters” tab, last item in the list. It displays differently to what we see in Samuel’s link, and it has a different hex code: E2 80 8D (again, 3 hex bytes …). After assigning a key code to this NSNB character (it doesn’t have one by default) I can insert it into a document and replicate something very similar to Samuel’s problem. In contrast to other special characters (eg. ©) I cannot insert this character directly into the “Find” box, using the assigned shortcut, nor can I copy-paste it from the document, as a single character. However, if I know, for example, that it is always preceded by a 'p' and always followed by a ‘q’ I can search for ‘p?q’ (copy-pasted as a 3-character group) and it finds that combination – including the NSNB represented by the ? wildcard for one required character – correctly. Samuel has said that his mystery character appears in places where he might expect to find a bullet (and there's indeed some typographical logic in the use of this special character in that situation), so maybe there’s a fixed pattern, similar to the one I’ve used above, that he can exploit to do the search. IOW, if Word accepts that the ? wildcard can find Word's E2 80 8D, maybe it will also find Samuel's EF BB BF. * Other combinations of Word and Windows may give different (or zero) mileage. HTH RL ▲ Collapse | | |
Dan Lucas United Kingdom Local time: 11:02 Member (2014) Japanese to English Zero width no-break space? | Aug 20, 2015 |
Samuel Murray wrote: Can anyone please tell me what is the character between the two brackets in this file: Emacs thinks it's a ZERO WIDTH NO-BREAK SPACE, as per following dump: position: 3 of 3 (67%), column: 0 character: (displayed as ) (codepoint 65279, #o177377, #xfeff) preferred charset: unicode (Unicode (ISO10646)) code point in charset: 0xFEFF script: arabic syntax: w which means: word to input: type "C-x 8 RET HEX-CODEPOINT" or "C-x 8 RET NAME" buffer code: #xEF #xBB #xBF file code: not encodable by coding system iso-latin-1-dos display: no font available Character code properties: customize what to show name: ZERO WIDTH NO-BREAK SPACE old-name: BYTE ORDER MARK general-category: Cf (Other, Format) decomposition: (65279) ('') As for find and replace, I think ^uxxxx is how to find unicode in MS Word. Perhaps ^u65279 is worth trying. Regards Dan | | |
Samuel Murray Netherlands Local time: 12:02 Member (2006) English to Afrikaans + ... TOPIC STARTER Dan licked it (one solution) | Aug 20, 2015 |
Dan Lucas wrote: Emacs thinks... ... decomposition: (65279) ('') Perhaps ^u65279 is worth trying. ^u65279 in the Find field works. Thanks, Dan. So... To repeat this solution with other characters, one has to either use Emacs, or... reduce a copy of the file to that character only (with known characters on either side of it) and save it as UTF8 plain text, then open it in a hex editor (such as Geoffrey Prewett's 150 KB one), take note of the hex code (in my case EFBBBF), and then find the corresponding HTML entity hex code (in my case #xfeff) and HTML entity decimal code (in my case 65279). One can do this here: http://www.google.com/search?q=site:.fileformat.info/info/unicode/char/%20efbbbf (for "efbbbf") To type this character in MS Word, type the HTML entity hex code and press Alt+X (i.e. type FEFF and press Alt+X). To find this character in the find/replace dialog, and presumably also find it in a macro, use the HTML entity decimal code preceded by "u^".
[Edited at 2015-08-21 07:42 GMT] | | |
Stepan Konev Russian Federation Local time: 13:02 English to Russian Select and Ctrl+H | Aug 20, 2015 |
Samuel Murray wrote: To repeat this solution with other characters To find such charachters without a code you need to select it and press Ctrl+H in MS Word. Do not try to copy/paste it – to no avail. Just select and Ctrl+H. The Replace fields appear empty, but when you click Replace all, all such chars disappear. I see similar sign in some places within a text translated by qTranslate+Google. But I don't have any idea why this happens... | |
|
|
Samuel Murray Netherlands Local time: 12:02 Member (2006) English to Afrikaans + ... TOPIC STARTER @Stepan (another, simpler solution) | Aug 21, 2015 |
Stepan Konev wrote: To find such charachters without a code you need to select it and press Ctrl+H in MS Word. ... The Replace fields appear empty, but when you click Replace all, all such chars disappear. Thanks, I didn't even know that I could auto-populate the Find field in MS Word by first selecting the term and then pressing Ctrl+H. This gave me an idea, though, which works: to find the HTML entity decimal code for the mystery character, simply record a macro with it. In other words, start recording the macro, then select the character, press Ctrl+H, replace it with anything, and stop recording the macro. Then step into the macro, and you'll see the HTML entity decimal code for it: in my case, ChrW(65279). == By the way, those Google Translate characters in your screenshot (which I suspect is inserted by Google to help them identify machine translated text while they crawl the web for translations), I simply remove using a macro: Sub gt_removechars() With ActiveDocument.Content.Find .ClearFormatting .Replacement.ClearFormatting .Execute FindText:=ChrW(8203), ReplaceWith:="", _ Replace:=wdReplaceAll End With End Sub Samuel | | |
Dan Lucas United Kingdom Local time: 11:02 Member (2014) Japanese to English
Samuel Murray wrote: Thanks, I didn't even know that I could auto-populate the Find field in MS Word by first selecting the term and then pressing Ctrl+H. I wasn't consciously aware of that either! Since ctrl+h is a common shortcut for find and replace (SDL Studio, Notepad++ and many more) I must have used it in Word without really thinking about it many times. Thank you to Stepan for explicitly pointing it out. Regards Dan | | |
Stepan Konev wrote: To find such charachters without a code you need to select it and press Ctrl+H in MS Word. Do not try to copy/paste it – to no avail. Just select and Ctrl+H. The Replace fields appear empty, but when you click Replace all, all such chars disappear. This is extremely valuable information. Thank you! | | |