Help identify mystery character in SDLXLIFF file (for MS Word) (General technical issues)

Technical forums » General technical issues »
Help identify mystery character in SDLXLIFF file (for MS Word)
Track this topic

Help identify mystery character in SDLXLIFF file (for MS Word)

Thread poster: Samuel Murray

Samuel Murray

Netherlands
Local time: 12:02
Member (2006)
English to Afrikaans
+ ...

Aug 20, 2015

Hello everyone

Can anyone please tell me what is the character between the two brackets in this file:
http://wikisend.com/download/340442/mystery%20character.zip
and how I can type that character in MS Word, and how I can do a find/replace action with that character in MS Word?

Thanks

Elif Baykara Narbay

Türkiye
Local time: 13:02
German to Turkish
+ ...

A grey square in a box?

Aug 20, 2015

I have downloaded the file and this is what I get.

I tried to match it in some font types such as webdings.. No success.. It maybe another substitute for characters which cannot be shown in a certain font type, just like the empty box..

Samuel Murray

Netherlands
Local time: 12:02
Member (2006)
English to Afrikaans
+ ...

TOPIC STARTER

EF BB BF

Aug 20, 2015

Samuel Murray wrote:
Can anyone please tell me what is the character between the two brackets in this file...?

My hex editor tells me the character is EF BB BF, which, incidentally, is the same as the character at the start of the file, i.e. a UTF8 byte order mark. However, in my SDLXLIFF file, this character occurs in places where I might expect bullets in a bullet list.

I can see the character in MS Word, but I can't copy it to the clipboard, so I'm going to have to learn how to type it in the find/replace box to be able to manipulate it.

Joakim Braun

Sweden
Local time: 12:02
German to Swedish
+ ...

Would this help

Aug 20, 2015

http://wordribbon.tips.net/T009167_Searching_for_Multi-Byte_Hex_Codes.html

Samuel Murray

Netherlands
Local time: 12:02
Member (2006)
English to Afrikaans
+ ...

TOPIC STARTER

@Joakim

Aug 20, 2015

Joakim Braun wrote:
Would this help?

Nope, I've tried that already, sorry.

The one method is to convert the hex code to decimal (e.g. using an online converter) and then type ^u0000 (where 0000 is the decimal code) in the search box. However, the decimal code for BB BF is 48063, and for EF BB BF it is 15711167, and neither of these codes find the mystery character.

The other method is to type the hex code and then press Alt+X. This method only works if the hex code has four letters/numbers, not more. The Alt+X conversion of BB BF is 뮿, which is not my character either.

Platary (X)
Local time: 12:02
German to French
+ ...

In my Ms Word

Aug 20, 2015

The mystery character looks like that :

ï»¿>ï»¿<

I can copy it and replace it with what I want (just a char or a word).

I use for that the Windows encoding by default.

Hope this helps!

Regards

Samuel Murray

Netherlands
Local time: 12:02
Member (2006)
English to Afrikaans
+ ...

TOPIC STARTER

@Adrien

Aug 20, 2015

Adrien Esparron wrote:
The mystery character looks like that:
ï»¿>ï»¿<
I use for that the Windows encoding by default.

The encoding of the text file is UTF8 with BOM (sorry, perhaps I should have mentioned it, but MS Word is usually pretty good at guessing text files' encoding and I had thought that all installations of MS Word will successfully identify the file as UTF8 with BOM).

If you open a text file that is encoded in one encoding as if it is encoded in another encoding (i.e., what you have done), then different characters will be displayed. If you want to see how this character is displayed in MS Word, then don't select "Windows (Default)" as the encoding, but "Other encoding: Unicode (UTF8)" when opening the file in MS Word.

Joakim Braun

Sweden
Local time: 12:02
German to Swedish
+ ...

Reversed byte order?

Aug 20, 2015

Samuel Murray wrote:

The Alt+X conversion of BB BF is 뮿, which is not my character either.

And BF BB?
Reversed byte order - worth a try.

Jennifer Levey

Chile
Local time: 06:02
Spanish to English
+ ...

Zero-Width No-Break Space

Aug 20, 2015

In UTF-8, EF BB BF is a zero-width no-break space (see: http://www.fileformat.info/info/charset/UTF-8/list.htm?start=43024 )

In Word, the equivalent character is called “No-Space Non Break” and on my system (Word 2000 / Win XP* ) it can be inserted it into a document via the “Insert Symbol” dialogue, “Special Characters” tab, last item in the list. It displays differently to what we see in Samuel’s link, and it has a different hex code: E2 80 8D (again, 3 hex bytes …).

After assigning a key code to this NSNB character (it doesn’t have one by default) I can insert it into a document and replicate something very similar to Samuel’s problem. In contrast to other special characters (eg. ©) I cannot insert this character directly into the “Find” box, using the assigned shortcut, nor can I copy-paste it from the document, as a single character. However, if I know, for example, that it is always preceded by a 'p' and always followed by a ‘q’ I can search for ‘p?q’ (copy-pasted as a 3-character group) and it finds that combination – including the NSNB represented by the ? wildcard for one required character – correctly.

Samuel has said that his mystery character appears in places where he might expect to find a bullet (and there's indeed some typographical logic in the use of this special character in that situation), so maybe there’s a fixed pattern, similar to the one I’ve used above, that he can exploit to do the search. IOW, if Word accepts that the ? wildcard can find Word's E2 80 8D, maybe it will also find Samuel's EF BB BF.

* Other combinations of Word and Windows may give different (or zero) mileage.

HTH
RL ▲ Collapse

Dan Lucas

United Kingdom
Local time: 11:02
Member (2014)
Japanese to English

Zero width no-break space?

Aug 20, 2015

Samuel Murray wrote:
Can anyone please tell me what is the character between the two brackets in this file:

Emacs thinks it's a ZERO WIDTH NO-BREAK SPACE, as per following dump:

position: 3 of 3 (67%), column: 0
character: (displayed as ) (codepoint 65279, #o177377, #xfeff)
preferred charset: unicode (Unicode (ISO10646))
code point in charset: 0xFEFF
script: arabic
syntax: w which means: word
to input: type "C-x 8 RET HEX-CODEPOINT" or "C-x 8 RET NAME"
buffer code: #xEF #xBB #xBF
file code: not encodable by coding system iso-latin-1-dos
display: no font available

Character code properties: customize what to show
name: ZERO WIDTH NO-BREAK SPACE
old-name: BYTE ORDER MARK
general-category: Cf (Other, Format)
decomposition: (65279) ('')

As for find and replace, I think ^uxxxx is how to find unicode in MS Word. Perhaps ^u65279 is worth trying.

Regards
Dan

Samuel Murray

Netherlands
Local time: 12:02
Member (2006)
English to Afrikaans
+ ...

TOPIC STARTER

Dan licked it (one solution)

Aug 20, 2015

Dan Lucas wrote:
Emacs thinks...
... decomposition: (65279) ('')
Perhaps ^u65279 is worth trying.

^u65279 in the Find field works. Thanks, Dan.

So...

To repeat this solution with other characters, one has to either use Emacs, or... reduce a copy of the file to that character only (with known characters on either side of it) and save it as UTF8 plain text, then open it in a hex editor (such as Geoffrey Prewett's 150 KB one), take note of the hex code (in my case EFBBBF), and then find the corresponding HTML entity hex code (in my case #xfeff) and HTML entity decimal code (in my case 65279). One can do this here:

http://www.google.com/search?q=site:.fileformat.info/info/unicode/char/%20efbbbf (for "efbbbf")

To type this character in MS Word, type the HTML entity hex code and press Alt+X (i.e. type FEFF and press Alt+X). To find this character in the find/replace dialog, and presumably also find it in a macro, use the HTML entity decimal code preceded by "u^".

[Edited at 2015-08-21 07:42 GMT]

Stepan Konev

Russian Federation
Local time: 13:02
English to Russian

Select and Ctrl+H

Aug 20, 2015

Samuel Murray wrote:
To repeat this solution with other characters

To find such charachters without a code you need to select it and press Ctrl+H in MS Word. Do not try to copy/paste it – to no avail. Just select and Ctrl+H. The Replace fields appear empty, but when you click Replace all, all such chars disappear.

I see similar sign in some places within a text translated by qTranslate+Google. But I don't have any idea why this happens...

Samuel Murray

Netherlands
Local time: 12:02
Member (2006)
English to Afrikaans
+ ...

TOPIC STARTER

@Stepan (another, simpler solution)

Aug 21, 2015

Stepan Konev wrote:
To find such charachters without a code you need to select it and press Ctrl+H in MS Word. ... The Replace fields appear empty, but when you click Replace all, all such chars disappear.

Thanks, I didn't even know that I could auto-populate the Find field in MS Word by first selecting the term and then pressing Ctrl+H.

This gave me an idea, though, which works: to find the HTML entity decimal code for the mystery character, simply record a macro with it. In other words, start recording the macro, then select the character, press Ctrl+H, replace it with anything, and stop recording the macro. Then step into the macro, and you'll see the HTML entity decimal code for it: in my case, ChrW(65279).

==

By the way, those Google Translate characters in your screenshot (which I suspect is inserted by Google to help them identify machine translated text while they crawl the web for translations), I simply remove using a macro:

Sub gt_removechars()
With ActiveDocument.Content.Find
.ClearFormatting
.Replacement.ClearFormatting
.Execute FindText:=ChrW(8203), ReplaceWith:="", _
Replace:=wdReplaceAll
End With
End Sub

Samuel

Dan Lucas

United Kingdom
Local time: 11:02
Member (2014)
Japanese to English

Hah, useful

Aug 21, 2015

Samuel Murray wrote:
Thanks, I didn't even know that I could auto-populate the Find field in MS Word by first selecting the term and then pressing Ctrl+H.

I wasn't consciously aware of that either! Since ctrl+h is a common shortcut for find and replace (SDL Studio, Notepad++ and many more) I must have used it in Word without really thinking about it many times. Thank you to Stepan for explicitly pointing it out.

Regards
Dan

Elizabeth Joy Pitt de Morales

Local time: 12:02
Member (2007)
Spanish to English
+ ...

Thanks!

Aug 21, 2015

Stepan Konev wrote:

To find such charachters without a code you need to select it and press Ctrl+H in MS Word. Do not try to copy/paste it – to no avail. Just select and Ctrl+H. The Replace fields appear empty, but when you click Replace all, all such chars disappear.

This is extremely valuable information. Thank you!

Login to reply/comment

To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Laureana Pavon	[Call to this topic]

You can also contact site staff by submitting a support request »

Help identify mystery character in SDLXLIFF file (for MS Word)

Forum rules

Help and orientation

Anycount & Translation Office 3000
Translation Office 3000 Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators. More info »

TM-Town
Manage your TMs and Terms ... and boost your translation business Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work. More info »


	X Sign in to your ProZ.com account... Username: Password: Forgot your password? Or create a new account