Help identify mystery character in SDLXLIFF file (for MS Word)
Autor de la hebra: Samuel Murray

Samuel Murray  Identity Verified
Países Bajos
Local time: 06:50
Miembro 2006
inglés a afrikaans
+ ...
Aug 20, 2015

Hello everyone

Can anyone please tell me what is the character between the two brackets in this file:
http://wikisend.com/download/340442/mystery%20character.zip
and how I can type that character in MS Word, and how I can do a find/replace action with that character in MS Word?

Thanks


 

Elif Baykara Narbay  Identity Verified
Turquía
Local time: 08:50
alemán a turco
+ ...
A grey square in a box? Aug 20, 2015

I have downloaded the file and this is what I get.

I tried to match it in some font types such as webdings.. No success.. It maybe another substitute for characters which cannot be shown in a certain font type, just like the empty box..


 

Samuel Murray  Identity Verified
Países Bajos
Local time: 06:50
Miembro 2006
inglés a afrikaans
+ ...
PERSONA QUE INICIÓ LA HEBRA
EF BB BF Aug 20, 2015

Samuel Murray wrote:
Can anyone please tell me what is the character between the two brackets in this file...?


My hex editor tells me the character is EF BB BF, which, incidentally, is the same as the character at the start of the file, i.e. a UTF8 byte order mark. However, in my SDLXLIFF file, this character occurs in places where I might expect bullets in a bullet list.

I can see the character in MS Word, but I can't copy it to the clipboard, so I'm going to have to learn how to type it in the find/replace box to be able to manipulate it.


 

Joakim Braun  Identity Verified
Suecia
Local time: 06:50
alemán a sueco
+ ...
Would this help Aug 20, 2015

http://wordribbon.tips.net/T009167_Searching_for_Multi-Byte_Hex_Codes.html

 

Samuel Murray  Identity Verified
Países Bajos
Local time: 06:50
Miembro 2006
inglés a afrikaans
+ ...
PERSONA QUE INICIÓ LA HEBRA
@Joakim Aug 20, 2015

Joakim Braun wrote:
Would this help?


Nope, I've tried that already, sorry.

The one method is to convert the hex code to decimal (e.g. using an online converter) and then type ^u0000 (where 0000 is the decimal code) in the search box. However, the decimal code for BB BF is 48063, and for EF BB BF it is 15711167, and neither of these codes find the mystery character.

The other method is to type the hex code and then press Alt+X. This method only works if the hex code has four letters/numbers, not more. The Alt+X conversion of BB BF is 뮿, which is not my character either.


 

Adrien Esparron
Local time: 06:50
Miembro 2007
alemán a francés
+ ...
In my Ms Word Aug 20, 2015

The mystery character looks like that :

><

I can copy it and replace it with what I want (just a char or a word).

I use for that the Windows encoding by default.

Hope this helps!

Regards


 

Samuel Murray  Identity Verified
Países Bajos
Local time: 06:50
Miembro 2006
inglés a afrikaans
+ ...
PERSONA QUE INICIÓ LA HEBRA
@Adrien Aug 20, 2015

Adrien Esparron wrote:
The mystery character looks like that:
><
I use for that the Windows encoding by default.


The encoding of the text file is UTF8 with BOM (sorry, perhaps I should have mentioned it, but MS Word is usually pretty good at guessing text files' encoding and I had thought that all installations of MS Word will successfully identify the file as UTF8 with BOM).

If you open a text file that is encoded in one encoding as if it is encoded in another encoding (i.e., what you have done), then different characters will be displayed. If you want to see how this character is displayed in MS Word, then don't select "Windows (Default)" as the encoding, but "Other encoding: Unicode (UTF8)" when opening the file in MS Word.


 

Joakim Braun  Identity Verified
Suecia
Local time: 06:50
alemán a sueco
+ ...
Reversed byte order? Aug 20, 2015

Samuel Murray wrote:

The Alt+X conversion of BB BF is 뮿, which is not my character either.




And BF BB?
Reversed byte order - worth a try.


 

Robin Levey
Chile
Local time: 02:50
español a inglés
+ ...
Zero-Width No-Break Space Aug 20, 2015

In UTF-8, EF BB BF is a zero-width no-break space (see: http://www.fileformat.info/info/charset/UTF-8/list.htm?start=43024 )

In Word, the equivalent character is called “No-Space Non Break” and on my system (Word 2000 / Win XP* ) it can be inserted it into a document via the “Insert Symbol” dialogue, “Special Characters” tab, last item in the lis
... See more
In UTF-8, EF BB BF is a zero-width no-break space (see: http://www.fileformat.info/info/charset/UTF-8/list.htm?start=43024 )

In Word, the equivalent character is called “No-Space Non Break” and on my system (Word 2000 / Win XP* ) it can be inserted it into a document via the “Insert Symbol” dialogue, “Special Characters” tab, last item in the list. It displays differently to what we see in Samuel’s link, and it has a different hex code: E2 80 8D (again, 3 hex bytes …).

After assigning a key code to this NSNB character (it doesn’t have one by default) I can insert it into a document and replicate something very similar to Samuel’s problem. In contrast to other special characters (eg. ©) I cannot insert this character directly into the “Find” box, using the assigned shortcut, nor can I copy-paste it from the document, as a single character. However, if I know, for example, that it is always preceded by a 'p' and always followed by a ‘q’ I can search for ‘p?q’ (copy-pasted as a 3-character group) and it finds that combination – including the NSNB represented by the ? wildcard for one required character – correctly.

Samuel has said that his mystery character appears in places where he might expect to find a bullet (and there's indeed some typographical logic in the use of this special character in that situation), so maybe there’s a fixed pattern, similar to the one I’ve used above, that he can exploit to do the search. IOW, if Word accepts that the ? wildcard can find Word's E2 80 8D, maybe it will also find Samuel's EF BB BF.

* Other combinations of Word and Windows may give different (or zero) mileage.

HTH
RL
Collapse


 

Dan Lucas  Identity Verified
Reino Unido
Local time: 05:50
Miembro 2014
japonés a inglés
Zero width no-break space? Aug 20, 2015

Samuel Murray wrote:
Can anyone please tell me what is the character between the two brackets in this file:

Emacs thinks it's a ZERO WIDTH NO-BREAK SPACE, as per following dump:

position: 3 of 3 (67%), column: 0
character:  (displayed as ) (codepoint 65279, #o177377, #xfeff)
preferred charset: unicode (Unicode (ISO10646))
code point in charset: 0xFEFF
script: arabic
syntax: w which means: word
to input: type "C-x 8 RET HEX-CODEPOINT" or "C-x 8 RET NAME"
buffer code: #xEF #xBB #xBF
file code: not encodable by coding system iso-latin-1-dos
display: no font available

Character code properties: customize what to show
name: ZERO WIDTH NO-BREAK SPACE
old-name: BYTE ORDER MARK
general-category: Cf (Other, Format)
decomposition: (65279) ('')

As for find and replace, I think ^uxxxx is how to find unicode in MS Word. Perhaps ^u65279 is worth trying.

Regards
Dan


 

Samuel Murray  Identity Verified
Países Bajos
Local time: 06:50
Miembro 2006
inglés a afrikaans
+ ...
PERSONA QUE INICIÓ LA HEBRA
Dan licked it (one solution) Aug 20, 2015

Dan Lucas wrote:
Emacs thinks...
... decomposition: (65279) ('')
Perhaps ^u65279 is worth trying.


^u65279 in the Find field works. Thanks, Dan.

So...

To repeat this solution with other characters, one has to either use Emacs, or... reduce a copy of the file to that character only (with known characters on either side of it) and save it as UTF8 plain text, then open it in a hex editor (such as Geoffrey Prewett's 150 KB one), take note of the hex code (in my case EFBBBF), and then find the corresponding HTML entity hex code (in my case #xfeff) and HTML entity decimal code (in my case 65279). One can do this here:

http://www.google.com/search?q=site:.fileformat.info/info/unicode/char/%20efbbbf (for "efbbbf")

To type this character in MS Word, type the HTML entity hex code and press Alt+X (i.e. type FEFF and press Alt+X). To find this character in the find/replace dialog, and presumably also find it in a macro, use the HTML entity decimal code preceded by "u^".


[Edited at 2015-08-21 07:42 GMT]


 

Stepan Konev  Identity Verified
Federación Rusa
Local time: 08:50
inglés a ruso
Select and Ctrl+H Aug 20, 2015

Samuel Murray wrote:
To repeat this solution with other characters


To find such charachters without a code you need to select it and press Ctrl+H in MS Word. Do not try to copy/paste it – to no avail. Just select and Ctrl+H. The Replace fields appear empty, but when you click Replace all, all such chars disappear.

I see similar sign in some places within a text translated by qTranslate+Google. But I don't have any idea why this happens...


 

Samuel Murray  Identity Verified
Países Bajos
Local time: 06:50
Miembro 2006
inglés a afrikaans
+ ...
PERSONA QUE INICIÓ LA HEBRA
@Stepan (another, simpler solution) Aug 21, 2015

Stepan Konev wrote:
To find such charachters without a code you need to select it and press Ctrl+H in MS Word. ... The Replace fields appear empty, but when you click Replace all, all such chars disappear.


Thanks, I didn't even know that I could auto-populate the Find field in MS Word by first selecting the term and then pressing Ctrl+H.

This gave me an idea, though, which works: to find the HTML entity decimal code for the mystery character, simply record a macro with it. In other words, start recording the macro, then select the character, press Ctrl+H, replace it with anything, and stop recording the macro. Then step into the macro, and you'll see the HTML entity decimal code for it: in my case, ChrW(65279).

==

By the way, those Google Translate characters in your screenshot (which I suspect is inserted by Google to help them identify machine translated text while they crawl the web for translations), I simply remove using a macro:

Sub gt_removechars()
With ActiveDocument.Content.Find
.ClearFormatting
.Replacement.ClearFormatting
.Execute FindText:=ChrW(8203), ReplaceWith:="", _
Replace:=wdReplaceAll
End With
End Sub

Samuel


 

Dan Lucas  Identity Verified
Reino Unido
Local time: 05:50
Miembro 2014
japonés a inglés
Hah, useful Aug 21, 2015

Samuel Murray wrote:
Thanks, I didn't even know that I could auto-populate the Find field in MS Word by first selecting the term and then pressing Ctrl+H.

I wasn't consciously aware of that either! Since ctrl+h is a common shortcut for find and replace (SDL Studio, Notepad++ and many more) I must have used it in Word without really thinking about it many times. Thank you to Stepan for explicitly pointing it out.

Regards
Dan


 

Elizabeth Joy Pitt de Morales  Identity Verified
Local time: 06:50
Miembro 2007
español a inglés
+ ...
Thanks! Aug 21, 2015

Stepan Konev wrote:

To find such charachters without a code you need to select it and press Ctrl+H in MS Word. Do not try to copy/paste it – to no avail. Just select and Ctrl+H. The Replace fields appear empty, but when you click Replace all, all such chars disappear.



This is extremely valuable information. Thank you!


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Help identify mystery character in SDLXLIFF file (for MS Word)

Advanced search






Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

More info »
SDL MultiTerm 2021
One central location to store and manage multilingual terminology.

By providing access to all those involved in applying terminology (such as engineers, marketers, translators, and terminologists), our terminology management solution ensures consistent and high-quality content from source through to translation.

More info »



Forums
  • All of ProZ.com
  • Búsqueda de términos
  • Trabajos
  • Foros
  • Multiple search