Can someone think of a clever way of converting a VERY LARGE collection of text files into UTF-8? (General technical issues)

Technical forums » General technical issues »
Can someone think of a clever way of converting a VERY LARGE collection of text files into UTF-8?
Track this topic

Pages in topic: < [1 2]

Can someone think of a clever way of converting a VERY LARGE collection of text files into UTF-8?

Thread poster: Michael Beijer

Michael Beijer

United Kingdom
Local time: 08:43
Member (2009)
Dutch to English
+ ...

TOPIC STARTER

Hmm

Nov 16, 2010

Well, let me speak from my own experience here.

I have downloaded and installed a very large number of programs from the internet in my life. Of these, some were shareware, and/or donationware, some were payware, and some were simply free, or open source, etc.

Now, I can definitely say that the number of virus/adware/spyware problems I had to fix on my computer as a direct result of installing programs THAT DID NOT COME FROM THE OPEN SOURCE CAMP, was MUCH MUCH greater. I would be very much surprised if someone were to state having had a different experience.

Let's be honest, how many times has an open source app from Sourceforge *&^#ed up your PC? And how many times has a program like the one under discussion done the same?

Programs that are very old, or are offered for free, but not really, etc, etc, generally speaking cause one more problems than Real Open Source software.

And I know that you will say: "yes but ...", and, "but how about ...", etc. However, speaking from simple everyday experience, the dangers involved in installing some esoteric little *nix app from Sourceforge, in no way equal those involved in surfing the big popular "Freeware" sites in search of a program to quickly solve a particular problem.

But then again, I might be wrong;)

Michael ▲ Collapse

Kevin Lossner

Portugal
Local time: 08:43
German to English
+ ...

memoQ?

Nov 16, 2010

Michael J.W. Beijer wrote:

I need them to be UTF-8 so I can align them and not end up with gibberish for various Dutch characters.

I noticed on your profile that you are a memoQ user. Why not simply use the LiveDocs feature in the latest version (4.5) and specify the codepage so it doesn't trash the characters? That will also save you the alignment time for the most part, though I've found some manual intervention to be necessary at times. Sometimes fiddling with options like "structural" helps.

I suppose you could also set up a project with the text files, specify UTF-8 output and simply copy source to target. If all you want is an alignment, concatenating the files might also save you some time in any of these procedures.

Samuel Murray

Netherlands
Local time: 09:43
Member (2006)
English to Afrikaans
+ ...

Good points

Nov 16, 2010

Michael J.W. Beijer wrote:
Let's be honest, how many times has an open source app from Sourceforge *&^#ed up your PC? And how many times has a program like the one under discussion done the same?

Agreed. Open source programs from dubious sources typically don't break your computer. The worst case is that they simply don't work (this often happens to me with open source software).

Closed source freeware is more likely to break something on your computer or install a trojan or suchlike, but one can minimise those experiences by just being careful and not being stupid (e.g. check what the developer's web site looks like, whether the tool is mentioned in forums, etc, and try to get a version of the program that is closest to what one might consider a reliable source).

If something looks particularly dubious but you want to try it anyway, one can use a sandbox tool or install it temprarily on a virtual machine. A good firewall and antivirus also helps.

The tool in question, uni2me 1.0, doesn't seem to break anything, and it works in batch, so it's a perfect fit for what was requested. If iconv works for you, then great. But what I dislike are sweeping statements about how glorious open source is and how dangerous closed source is.

Michael Beijer

United Kingdom
Local time: 08:43
Member (2009)
Dutch to English
+ ...

TOPIC STARTER

LiveAlign™

Nov 16, 2010

Hi KSL Berlin,

Yes, I am indeed, as we speak, working on importing them all into a LiveDocs corpus, and things are going quite well. As you suggested, I am fiddling around with the "structural" setting, as well as trying to create the best import template for this specific type of EUconst files. They're text file alignments from the OPUS project (http://urd.let.rug.nl/tiedeman/OPUS/).

I also today noticed that I could specify the code page in there as well, which has solved the UTF-8 problem.

Basically, what I am trying to do is decide on a Best Strategy for dealing with large amounts of bilingual data, which I will then stick with. (Well, that's the plan anyway). I don't want to start aligning thousands of files, only to later discover that there was a much better way to do it;)

So far, I have found that AlignFactory Light does the best job in terms of alignment. However, it has a maximum of 100 files per batch, and my current batch counts over 5,000. I have also tried FarkasAndras' LF Aligner (http://sourceforge.net/projects/aligner/), but have been having some problems with that.

I am now planning on doing it all in memoQ, with the new ("game-changing") LiveDocs™/LiveAlign™ feature. One of the coolest things about this method is (if it works) that I don't need to worry about the alignments being perfect or not, because I can simply right-click on the term in question and its context will pop up. I've had quite a few crashes so far though, mostly due to sth to do with the relation between corpora and the translation editor, but I seem to be making progress there.

In any case, if the new Live stuff in memoQ 4.5 works as they say it does ... this might actually change the way people use their TMs from now on. We shall see.

Michael

p.s. I noticed the following in the list of available filters: "Typo3 : XML filter for Typo3 CMS pages. Contributed by Carsten Peters." This sounds very interesting. I suppose new filters might start appearing in the memoQ mailing list soon.

p.p.s. I tried concatenating files (with copy /a *.log aggregate.txt), but when they get really big, all of the programs start having problems. AlignFactory, LF Aligner, and memoQ.

[Edited at 2010-11-16 22:11 GMT]

[Edited at 2010-11-16 22:32 GMT] ▲ Collapse

Michael Beijer

United Kingdom
Local time: 08:43
Member (2009)
Dutch to English
+ ...

TOPIC STARTER

@Samuel

Nov 16, 2010

Samuel Murray wrote:

The tool in question, uni2me 1.0, doesn't seem to break anything, and it works in batch, so it's a perfect fit for what was requested. If iconv works for you, then great. But what I dislike are sweeping statements about how glorious open source is and how dangerous closed source is.

Yes, I agree with you on that point. Both have their strong points, and weak points. It is definitely true that with a very large amount of OS software, you spend most of your time trying to make it work. That's why I gave up on Linux for my real work, and now just stick with good old XP Pro.

Michael

FarkasAndras

Local time: 09:43
English to Hungarian
+ ...

LF Aligner

Nov 16, 2010

Michael J.W. Beijer wrote:

So far, I have found that AlignFactory Light does the best job in terms of alignment. However, it has a maximum of 100 files per batch, and my current batch counts over 5,000. I have also tried FarkasAndras' LF Aligner (http://sourceforge.net/projects/aligner/), but have been having some problems with that.

Totally off topic here, but feel free to post them in the LF aligner thread.

Michael Beijer

United Kingdom
Local time: 08:43
Member (2009)
Dutch to English
+ ...

TOPIC STARTER

@FarkasAndras

Nov 16, 2010

Yup.

I'll make sure to tell you what is and isn't working. I'm actually having quite a bit of trouble at the moment, trying to import my folder of 6,000 text files into memoQ for alignment, and am starting to think that none of the programs are going to be able to deal with that number of files.

Hmm.

Michael

FarkasAndras

Local time: 09:43
English to Hungarian
+ ...

Industrial tools for industrial problems

Nov 16, 2010

Michael J.W. Beijer wrote:

Yup.

I'll make sure to tell you what is and isn't working. I'm actually having quite a bit of trouble at the moment, trying to import my folder of 6,000 text files into memoQ for alignment, and am starting to think that none of the programs are going to be able to deal with that number of files.

Hmm.

Michael

I see no reason why LF aligner wouldn't, with the right settings. Perl itself should certainly be capable of that, and one of my main goals with the aligner was robustness and handling large files. The files are never loaded into memory in full etc. 6000 is definitely pretty extreme, but it does them one by one, so it should work in principle. Anyway, I'm curious about performance with that sort of load too, so go right ahead and try - and post your results in the other thread. I'd try a dozen files first, work out the right setup, then try a couple of hundred, then all 6000.

[Edited at 2010-11-16 23:48 GMT]

Kevin Lossner

Portugal
Local time: 08:43
German to English
+ ...

Communicate with support

Nov 20, 2010

Michael J.W. Beijer wrote:
I am now planning on doing it all in memoQ, with the new ("game-changing") LiveDocs™/LiveAlign™ feature...
p.p.s. I tried concatenating files (with copy /a *.log aggregate.txt), but when they get really big, all of the programs start having problems. AlignFactory, LF Aligner, and memoQ.

I had a few issues with large Java properties files and LiveDocs (mostly my ignorance as I was learning about the feature), and I found it very helpful to communicate with Kilgray support on the matters. Given that this is a new feature undergoing continued improvement, they might find your project interesting and useful to push the limits of the tool.

Keep us posted, please. This feature interests me a great deal as well, and at this stage reports are very useful.

FarkasAndras

Local time: 09:43
English to Hungarian
+ ...

readme

Nov 20, 2010

Michael J.W. Beijer wrote:
I tried concatenating files, but when they get really big, all of the programs start having problems. AlignFactory, LF Aligner, and memoQ.

LF Aligner was made specifically for this sort of use. I have done files as large as 400,000 segments and I've never had it fail due to file size. Just open the settings file and set the "Chop up files larger than this size" property to 15000 or so to put it in "large file mode" and you're good to go. What it does is chop up the files, align them in bits and then concatenate the output.

BTW, doesn't the OPUS site offer aligned files for download? Or are you not happy with their alignments?

Tom Hoar (X)
United States
Local time: 03:43
English

CorpusFiltergraph will do it

Dec 8, 2010

CorpusFiltergraph can do the job.

You said "large"... how big is that? Last year, we used it to extract/align over 7.5 million XML files into SMT training corpora. Documentation needs a lot of work!

http://sourceforge.net/projects/corpfiltergraph/

Tom

Pages in topic: < [1 2]

Login to reply/comment

To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Laureana Pavon	[Call to this topic]

You can also contact site staff by submitting a support request »

Can someone think of a clever way of converting a VERY LARGE collection of text files into UTF-8?

Forum rules

Help and orientation

Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business. More info »

Anycount & Translation Office 3000
Translation Office 3000 Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators. More info »

Recent posts | FAQ | Rules | Moderators | Article knowledgebase

Your current localization setting

English

Select a language

More languages...

Can someone think of a clever way of converting a VERY LARGE collection of text files into UTF-8?

Can someone think of a clever way of converting a VERY LARGE collection of text files into UTF-8?

You have native languages that can be verified

Your current localization setting

Select a language