Can someone think of a clever way of converting a VERY LARGE collection of text files into UTF-8? (General technical issues)

Technical forums » General technical issues »
Can someone think of a clever way of converting a VERY LARGE collection of text files into UTF-8?
Track this topic

Pages in topic: [1 2] >

Can someone think of a clever way of converting a VERY LARGE collection of text files into UTF-8?

Thread poster: Michael Beijer

Michael Beijer

United Kingdom
Local time: 10:44
Member (2009)
Dutch to English
+ ...

Nov 16, 2010

I need them to be UTF-8 so I can align them and not end up with gibberish for various Dutch characters.

Thanks,

Michael

Soonthon LUPKITARO(Ph.D.)

Thailand
Local time: 16:44
English to Thai
+ ...

Batch processing

Nov 16, 2010

If you do translation alignment, Trados and many CAT have batch processing steps you can select the output into UTF-8 format. Or you just convert your text files into UTF-8 format by saving the translation [without translating actually.] MS Word also do batch processing for 'save as' function.

Soonthon Lupkitaro

fidaa2007
Egypt
Local time: 11:44
English to Arabic
+ ...

Use uni2me10

Nov 16, 2010

Hi,

I currently work on a Transcription project that requires the massive number of txt files to be in UTF-8 form. So, I use uni2me10, it is very easy, if you don't have it, I can email it to you.

Best,
Fidaa

FarkasAndras

Local time: 11:44
English to Hungarian
+ ...

iconv

Nov 16, 2010

The "proper" solution is iconv. It's originally a unix tool, but there's a windows port. The trouble is, it has no built-in batch mode - the assumption is that unix users can figure out the necessary shell command themselves.
It would be fairly easy to write a .bat or a simple perl script that loops an iconv command to convert all the files in a folder, perhaps I'll wh... See more

FarkasAndras

Local time: 11:44
English to Hungarian
+ ...

Batch tool

Nov 16, 2010

I decided this was so simple I had to just go ahead and do it. Enjoy.

Note: I wouldn't use tools like uni2me10. I just googled it and the first hit is this page on proz... not a sign of a widely used and recognized tool. How do you know it works as advertised and doesn't install a trojan on your system?
Iconv is open source, so it is guaranteed to contain no malicious code and it's part of the GNU project so you it's guaranteed to be pretty good - probably as good as it gets. I can't think of a reason to use anything else. Of course you should always get your tools from a trusted source, which in this case would be http://gnuwin32.sourceforge.net/packages/libiconv.htm if you're on windows and, well, if you're not, it probably came preinstalled on your system.

[Edited at 2010-11-16 10:00 GMT] ▲ Collapse

Samuel Murray

Netherlands
Local time: 11:44
Member (2006)
English to Afrikaans
+ ...

What format are they currently in?

Nov 16, 2010

Michael J.W. Beijer wrote:
I need them to be UTF-8 so I can align them and not end up with gibberish for various Dutch characters.

What format are they currently in? Are the Dutch characters not gibberish in the existing format? Why would aligning them cause the Dutch characters to turn into gibberish?

Samuel Murray

Netherlands
Local time: 11:44
Member (2006)
English to Afrikaans
+ ...

On sweeping statements

Nov 16, 2010

FarkasAndras wrote:
I wouldn't use tools like uni2me10. I just googled it and the first hit is this page on proz... not a sign of a widely used and recognized tool. How do you know it works as advertised and doesn't install a trojan on your system?

I also googled for it, and I found that it is abandonware, but still available:
http://web.archive.org/web/*/http://alf-li.pcdiscuss.com/files/uni2me10.zip

It is free for single personal use (whatever that means). I haven't tested it comprehensively but it seems pretty straight-forward and can convert many different formats. The author's home page is here: http://web.archive.org/web/20080730002928/http://alf-li.pcdiscuss.com/

Iconv is open source, so it is guaranteed to contain no malicious code...

There is nothing in open source software that prevents maliciousness.

...and it's part of the GNU project so you it's guaranteed to be pretty good - probably as good as it gets.

Some GNU programs are pretty good, and some of them are quite bad.

[Edited at 2010-11-16 13:25 GMT]

Michael Beijer

United Kingdom
Local time: 10:44
Member (2009)
Dutch to English
+ ...

TOPIC STARTER

@Samuel

Nov 16, 2010

They are EUconst text files, such as this one: http://beijer.mx/storage/ep-00-10-05_nl.txt (however, that one has been re-saved in UltraEdit as UTF-8)

What was happening was that when importing them into memoQ LiveDocs, they would have garbled characters here and there where there were Dutch characters. However, I since realised that in memoQ there is a setting to import and use internally as UTF-8, so the problem seems to have been solved. Although I am playing around with FarkasAndras' cool little .bat.

Incidentally, FarkasAndras: how about building that into LF Aligner as well, or better yet, using it to convert all import and export (automatically) in LF Aligner into UTF-8

Michael ▲ Collapse

FarkasAndras

Local time: 11:44
English to Hungarian
+ ...

Messy

Nov 16, 2010

Michael J.W. Beijer wrote:

They are EUconst text files, such as this one: http://beijer.mx/storage/ep-00-10-05_nl.txt (however, that one has been re-saved in UltraEdit as UTF-8)

What was happening was that when importing them into memoQ LiveDocs, they would have garbled characters here and there where there were Dutch characters. However, I since realised that in memoQ there is a setting to import and use internally as UTF-8, so the problem seems to have been solved. Although I am playing around with FarkasAndras' cool little .bat.

Incidentally, FarkasAndras: how about building that into LF Aligner as well, or better yet, using it to convert all import and export (automatically) in LF Aligner into UTF-8

Michael

I'd love to be able to do that, but file encoding is a huge mess so it's not that simple. Basically, there is no easy and reliable way of telling what encoding a given text file is in. Ther is no metadata ("this file is in Latin-2 encoding with Unix line breaks") attached to text files. As you can see, even MemoQ, a reasonably good piece of commercial software fails to automatically handle files correctly.
When you open a file with a text editor, it basically takes an educated guess at the encoding based on what the file looks like, and I don't feel like replicating that in the aligner. I could integrate the functionality of this .bat, i.e. ask the user to specify the input encoding and reencode the files based on that - but in many ways, that would be worse that the current system of telling the user to open the file in a text editor and resave it in UTF-8. It would be more error-prone and it would require a bit more computer literacy from the user.
Of course this functionality could be very useful in the batch aligner... maybe, maybe. Until then, there's the bat for batch conversions and text editors for one-offs.

FarkasAndras

Local time: 11:44
English to Hungarian
+ ...

?	Nov 16, 2010

Samuel Murray wrote:
There is nothing in open source software that prevents maliciousness.

What? Did you even mean that? In my part of the universe, things like the entire source code being available for anyone to see make it pretty safe to assume there's no malicious code in there. Writing open-source malware would be an amazingly dumb move - if the project has any visibility at all, it will be found out nearly instantly.

I would propose that iconv, the character converter used by default by pretty much every linux distro and OS X probably works better and is safer than some random character converter found online... but whatever floats your boat.

Of course if you don't compile from source yourself, you only get a black box as usual... but if you download from a safe source and/or compare md5 checksums, you're guaranteed to get what you think you're getting.

Samuel Murray

Netherlands
Local time: 11:44
Member (2006)
English to Afrikaans
+ ...

On maliciousness and opensource (veering off-topic here)

Nov 16, 2010

FarkasAndras wrote:

Samuel Murray wrote:
There is nothing in open source software that prevents maliciousness.

1. Writing open-source malware would be an amazingly dumb move - if the project has any visibility at all, it will be found out nearly instantly.

Activists of open source would like us to believe that whenever anyone releases software as open source, scores of programmers will download and check the source code to verify it and to contribute code to improve the program. Quite a few closed source projects (useful and mature products) had gone open source and then gone closed source again a year or two later, due to lack of interest. Not all open source projects generate interest.

It is also a myth that anyone who knows a certain programming language can read the source code of a program and within a very short period of time figure out what the program does, what its weak points are and whether the program does anything unintended or unwanted. That's like thinking that a translator can open a pocket dictionary and know after just a short investigation which of the words in it are misspelt or which of the definitions in it are incorrect.

Furthermore, a maliciously malicious programmer would hide his malicious code so that it doesn't appear malicious unless you look very closely or are very well versed in that programming language.

2. I would propose that iconv, the character converter used by default by pretty much every linux distro and OS X probably works better and is safer than some random character converter found online.

True, but that doesn't mean the random converter is necessarily unsafe. And if the random converter does the job better, why not use it?

3. Of course if you don't compile from source yourself, you only get a black box as usual... but if you download from a safe source and/or compare md5 checksums.

There is no way of knowing whether the compiled version of a file you download from an open source site is the compiled version of the accompanying source code, unless you re-compile the source code just to check. A malicious developer can put a trojan in the compiled version and no-one will be the wiser unless someone actually checks up on him or reports the trojan activity to some online forum.

Similarly, the md5 checksum only tells you that the file you downloaded is the same file as the one that is mentioned on the page where the md5 checksum is mentioned, but it doesn't give any guarantees about the safely of the program unless you absolutely trust the web site that provide those md5 numbers for you.

Don't be fooled into thinking that an open source product is necessarily safer than a closed source one. Even if the developer is not malicious, his software might cause damage to your system without intention to do so, and the open source nature of the software will do nothing to prevent that.

As for iconv, I've tried the program in the past and found it unreliable. For example, if you convert a UTF8Y file (UTF8 with BOM) to UTF16LE, then everything is fine, but if you convert a UTF8N file (UTF8 without BOM) to UTF16LE, the UTF16LE file will be BOM-less, and this will cause some text editors to read it as ANSI. And if you try to convert a UTF8N file to Latin-1, you'll succeed, but if you try to convert a UTF8Y fil to Latin-1, iconv tells you that it can't do the conversion.

Michael Beijer

United Kingdom
Local time: 10:44
Member (2009)
Dutch to English
+ ...

TOPIC STARTER

"The output UTF8 files it generates don't have a BOM and it doesn't accept input UTF8 files w/ BOM"

Nov 16, 2010

Samuel Murray wrote:
As for iconv, I've tried the program in the past and found it unreliable. For example, if you convert a UTF8Y file (UTF8 with BOM) to UTF16LE, then everything is fine, but if you convert a UTF8N file (UTF8 without BOM) to UTF16LE, the UTF16LE file will be BOM-less, and this will cause some text editors to read it as ANSI. And if you try to convert a UTF8N file to Latin-1, you'll succeed, but if you try to convert a UTF8Y fil to Latin-1, iconv tells you that it can't do the conversion.

As far as I can tell it shouldn't be used on files with a BOM.

"The output UTF-8 files it generates don't have a BOM and it doesn't accept input UTF-8 files with a BOM." (from the info file in FarkasAndras' iconv_batch)

Michael

Samuel Murray

Netherlands
Local time: 11:44
Member (2006)
English to Afrikaans
+ ...

@Michael

Nov 16, 2010

Michael J.W. Beijer wrote:
As far as I can tell [iconv] shouldn't be used on files with a BOM.

The output UTF-8 files it generates don't have a BOM and it doesn't accept input UTF-8 files with a BOM." (from the info file in FarkasAndras' iconv_batch)

No, I think you're confusing two things here. The reason for Farkas' comment about UTF8 BOMs has to do with UTF8 itself (some purists believe that UTF8 files should not have BOMs, ever). The UTF8 to UTF16LE conversion problem I'm having is not related to the usually UTF8 BOM issue. UTF16LE files usually do have BOMs (even UTF8 purists generally acknowledge that).

I can only guess but I don't think iconv has a problem with UTF8 BOMs -- the problem is that iconv doesn't add a BOM to a target file if the source file didn't have a BOM (even if the BOMs are different, and even if the target file typically requires a BOM). This would mean that iconv is *very* primitive.

Even if these issues were related, it means that you'd have to pump the files through another converter first (one that removes the BOM) before you could pump them through iconv.

FarkasAndras

Local time: 11:44
English to Hungarian
+ ...

Security

Nov 16, 2010

Samuel Murray wrote:
There is no way of knowing whether the compiled version of a file you download from an open source site is the compiled version of the accompanying source code, unless you re-compile the source code just to check.

True enough, nothing is truly 100% in this life. But if you're getting something from a respectable site like sourceforge.net, and that something happens to be part of a major open source project, that's as close to a 100% guarantee of safety as it gets - not even stuff you get from IBM or Kaspersky is truly 100% secure.
If you get something from an unknown source, the odds are a lot worse than with any of the major open source sites.

Samuel Murray wrote:

Don't be fooled into thinking that an open source product is necessarily safer than a closed source one.

I would say that it is, given the same scenario. If you have 10 minutes to devote to finding a small utility like a character converter or a media transcoder for no money, you are far better off going for an open source solution. Get it from sourceforge.net or the original developer's site as linked from Wikipedia, and you can be 99.99% sure it's kosher. If you want the same level of trustworthyness from closed source software, you'll usually have to spend a hell of a lot more time researching the publisher, or you'll have to pay.

Can't comment on the specifics, I only have limited experience with iconv. I'm not sure if UTF-16LE files are required to have a BOM or not, the problem may be with the text editor.
Iconv does seem to fail on UTF8Y -> Latin-1, which is an annoying bug, but kind of understandable. In the *nix world, UTF-8 files with a BOM are considered malformed and no *nix program I know produces UTF8Y... it was probably never designed to handle files like that. Still, it should know better.
My guess is that it doesn't expect a BOM in UTF-8 files, so it handles it as the first character(s) of the file, not a BOM - which causes the Y/N BOM alternation in output files, and the failure with Latin-1 (I'm guessing FEFF is not valid in Latin-1 and it finds the malformed characters in a post-conversion check and reports the error).

Samuel Murray

Netherlands
Local time: 11:44
Member (2006)
English to Afrikaans
+ ...

More myths (still off-topic)

Nov 16, 2010

FarkasAndras wrote:
True enough, nothing is truly 100% in this life. But if you're getting something from a respectable site like sourceforge.net...

Sourceforge.net is just a host. Saying something is more likely to be good because you got it from Sourceforge.net is like saying a translator is more likely to be good because you found him via ProZ.com. Neither ProZ.com nor Sourceforge.net has any control over the quality of the items (translators or software) listed there (though both have mechamisms and procedures in place to get rid of malicious items... as and when they are discovered). Sourceforge.net has no way of checking the nature or integrity of the compiled programs hosted on it.

Pages in topic: [1 2] >

Login to reply/comment

To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Laureana Pavon	[Call to this topic]

You can also contact site staff by submitting a support request »

Can someone think of a clever way of converting a VERY LARGE collection of text files into UTF-8?

Forum rules

Help and orientation

CafeTran Espresso
You've never met a CAT tool this clever! Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free Buy now! »

TM-Town
Manage your TMs and Terms ... and boost your translation business Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work. More info »

Recent posts | FAQ | Rules | Moderators | Article knowledgebase

Your current localization setting

English

Select a language

More languages...

Can someone think of a clever way of converting a VERY LARGE collection of text files into UTF-8?

Can someone think of a clever way of converting a VERY LARGE collection of text files into UTF-8?

You have native languages that can be verified

Your current localization setting

Select a language