How do you speed up your term/phrase search process (for TM, glossary, termbases)? (CAT Tools Technical Help)

Technical forums » CAT Tools Technical Help »
How do you speed up your term/phrase search process (for TM, glossary, termbases)?
Track this topic

Pages in topic: [1 2] >

How do you speed up your term/phrase search process (for TM, glossary, termbases)?

Thread poster: Alex Aruj

Alex Aruj

United States
Local time: 14:27
Spanish to English
+ ...

Oct 19, 2014

I'm curious who of you have longed to experience BETTER and FASTER accessibility of memories, glossaries or documents you have created, downloaded or acquired digitally on your machine or cloud.

After experiencing some issues with speed when opening and searching massive DGT translation memories in TRADOS (0.5 GB per TM), I went to look for resources to help search tmx files using a system that runs on algorithms that are faster and accessible and do not crash my CAT tool! Note: I also had issues with Trados not wanting to open from minimized view when it was left in the Translation Memory viewing pane, hence my exacerbated frustration with my term search process.

There are platforms based on Apache Lucene:http://lucene.apache.org/core/
Lucene explained on Wikipedia http://en.wikipedia.org/wiki/Lucene
Lucene is ubiquitous in the background of many search engines and allow anyone the search capability and access to text in any container (Word files, pdf, html, etc), thus the ability to search across many files and in full text using any number of search parameters according to user taste, e.g. fuzzy search, exact search, stemmed words, "words similar to", etc on an interface that runs independently of CAT tools, of course.

I haven't made it there yet to index my translation memories, glossaries, Word files, and pdfs, but I want to spread the word and inspire new possibilities and paradigms and ask:

What is the size of your resource database or frequently searched TMs in GB?
How do you access these files? Would you like to access them all at once?
Are you frustrated with the search speed of your computer/CAT tool search application e.g. Trados TM search?

OR

Are these questions of size, speed and access irrelevant since you are using online search portals?

[Edited at 2014-10-19 18:15 GMT]

[Edited at 2014-10-19 18:16 GMT]

[Edited at 2014-10-19 18:18 GMT] ▲ Collapse

Michael Beijer

United Kingdom
Local time: 22:27
Member (2009)
Dutch to English
+ ...

Hi Alex,

Oct 19, 2014

I highly recommend: TMLookup.

I have tried every program under the sun to search very large amounts of TMXs, and TMLookup wins hands down. My current db contains 45,000,000 TUs (!!!), and most searches complete in seconds (average search times are around 0-3 seconds). The guy that made LF Aligner made it and it is free:

***** http://www.farkastranslations.com/tmlookup.php *****

Another option, which is still new and a little rough around the edges (Igor only made it a few days ago), is CafeTran's latest cool new feature called ‘Total Recall’. Have a look:

Hans’s wiki: http://cafetran.wikidot.com/total-recall
Recent post here on Proz: http://www.proz.com/forum/cafetran_support/276038-amazing_new_feature_total_recall_is_back.html
The mailing list: https://groups.google.com/forum/#!topic/cafetranslators/gTQtifk_vOU

It basically allows you to index massive amounts of TMX data and use it to pre-translate your text (it's also connected to CT's auto-assembly system).

Michael ▲ Collapse

Danesh
Local time: 00:57
English to Persian (Farsi)

TMLookup or CafeTran's 'Total Recall' search ffunctionalities

Oct 20, 2014

Michael,
Can TMLookup or CafeTran's 'Total Recall' search and give access to text in other containers (.pdf, html, docx, etc) besides TMXs as Lucene can?

Rolf Keller
Germany
Local time: 23:27
English to German

Fast searching via Multifultor

Oct 20, 2014

If you want "Fast Searching", you need a tool that can search several resources (online & offline) símultaneously.

Use the URL in my profile and download multifultor.zip or its Readme.pdf.

Michael Beijer

United Kingdom
Local time: 22:27
Member (2009)
Dutch to English
+ ...

@Danesh:

Oct 20, 2014

Danesh wrote:

Michael,
Can TMLookup or CafeTran's 'Total Recall' search and give access to text in other containers (.pdf, html, docx, etc) besides TMXs as Lucene can?

Nope, just TMXs. They are both designed to work with bilingual translation memories.

If you need a desktop search program, I'd suggest dtSearch.

Michael

Michael Beijer

United Kingdom
Local time: 22:27
Member (2009)
Dutch to English
+ ...

In answer to your 3 questions…

Oct 20, 2014

Alex Aruj wrote:

I haven't made it there yet to index my translation memories, glossaries, Word files, and pdfs, but I want to spread the word and inspire new possibilities and paradigms and ask:

1. What is the size of your resource database or frequently searched TMs in GB?
2. How do you access these files? Would you like to access them all at once?
3. Are you frustrated with the search speed of your computer/CAT tool search application e.g. Trados TM search?

OR

Are these questions of size, speed and access irrelevant since you are using online search portals?

1. Very, very large. I have hundreds of TMXs scattered across my computers, but the basic selection I like to have accessible for concordance searching is around 45,000,000 TUs. I can't remember how many actual TMXs I imported into my TMLookup db to attain this (10-20 GB worth maybe; just guessing). My TMLookup default.db is currently 25GB.

2. In various ways. I access some as TMXs attached to my CafeTran project, the big collection mentioned above via TMLookup, some pop up when searching in dtSearch (my dtSearch index is currently 120GB!), etc. I also sometimes search in individual TMXs, which I open in Heartsome's TMX Editor, Olifant, EmEditor (a text editor) or Xbench.

3. Not really. However, I would be very interested in a potential Lucene-based solution, especially one that could index and search multiple document types (not just TMX or .txt). I played with Lucene a while ago but didn't get very far.

Michael

PS: as far as I know, TMLookup uses an SQLite db and CT's Total Recall uses an H2 (http://www.h2database.com/html/main.html) db.

[Edited at 2014-10-20 09:55 GMT]

FarkasAndras

Local time: 23:27
English to Hungarian
+ ...

TMLookup

Oct 20, 2014

PS: as far as I know, TMLookup uses an SQLite db

Yes, it uses the FTS (full-text search) functionality of SQLite. I looked into lucene as a potential search engine when starting to work on TMLookup but it didn't work out so I just went with SQLite. (IIRC it would have required too much setup to get it to work and it would have been impossible to package it with TMLookup.)

Alex Aruj

United States
Local time: 14:27
Spanish to English
+ ...

TOPIC STARTER

I have been missing out!

Oct 20, 2014

Thanks Michael, Rolf and András for offering these valuable tools, which look very promising. I have opened up TMLookup and threw a few items in and then in my haste I added something I then wished to remove, performed a few lookups. I can't wait to add more. For good measure, I am taking MultiFultor with me too.

András, Rolf, any interest in opening up your code up for hacking on GitHub? I am not super-familiar with Perl, but it could be a nice challenge to try to work on, or to port the code to other languages and add more functionality. I am a fairly new programmer with some awareness of neat Python libraries and I am a C++ student currently in my third semester, so I am definitely open to collaboration and putting what I've learned to work.

As for dtSearch and CafeTran, I will keep them bookmarked.

For those with interest in Lucene, without touching it, ElasticSearch comes highly recommended.

I found this today, which appears to allow access to EU text data through something called CKAN.
https://open-data.europa.eu/en/developerscorner

https://github.com/ckan/ckanapi

[Edited at 2014-10-20 21:04 GMT] ▲ Collapse

FarkasAndras

Local time: 23:27
English to Hungarian
+ ...

Development

Oct 21, 2014

Alex Aruj wrote:

András, Rolf, any interest in opening up your code up for hacking on GitHub? I am not super-familiar with Perl, but it could be a nice challenge to try to work on, or to port the code to other languages and add more functionality. I am a fairly new programmer with some awareness of neat Python libraries and I am a C++ student currently in my third semester, so I am definitely open to collaboration and putting what I've learned to work.

I don't have the project on github, primarily because I can't be bothered to register and learn how the site works. In any case, all my software projects are open source including TMLookup. Anyone is free to contribute, modify, rewrite etc. Contact me by email at andras(farkastranslations.com) if you want to get your hands dirty. I do have some plans for additional TMLookup features, and if you have an interest on working on it, it would be best to coordinate. Porting to other languages is an interesting idea... I'm guessing that if you really wanted to do it, rewriting from scratch might be better. Perl is cross-platform and seems to work ok, so I don't really see a reason to do it unless you want flashier graphics. Migrating to a different database engine or adding support for multiple database engines might be worthwhile. Basically, most of the things worth improving in TMLookup are database-related things (improving speed, search features and the database format, adding support for multiple databases in parallel, adding support for dbs in different formats), which I know very little about. I leared as I went along with the project. One other thing that might be worth considering is adding support for querying online resources like xbench does.
Note that I'm not a professional programmer. TMLookup is a hobby project, and it shows in the code quality. It works, but it's not what I would call elegant or professional. Proceed at your own risk.

Alex Aruj wrote:

For those with interest in Lucene, without touching it, ElasticSearch comes highly recommended.

I found this today, which appears to allow access to EU text data through something called CKAN.
https://open-data.europa.eu/en/developerscorner

https://github.com/ckan/ckanapi

Interesting, I didn't know about the API. You can search the DGT-TM, IATE termabse and Eurovoc through it. They are all available for download and I prefer offline searches through my CAT or TMLookup/xbench to online queries, but I'm sure some people will be very happy about it. There may be other linguistic resources in the open data collection, but I don't know how to find them.

I had a brief look at elasticsearch. It looks like lucene is a level of abstraction added over the raw db, and elasticsearch is another level added on top of lucene. For my purposes it's probably better to stay closer to the db engine itself. These things seem to be designed to work as web services or applications installed on a single machine, not as small self-contained apps that are distributed to many people with no tech support. They are also designed to handle sets of documents in a single language, not aligned multilingual sets of sentences. Still, if you know either lucene or elasticsearch and think they can be intergrated easily, let me know. They do have advanced features like fuzzy search, in-word fragment search and typo-correction search suggestions that are unlikely to ever be added to TMLookup using the current database engine (SQLite). Basically SQLite wasn't designed for text search like this, so some features are missing or imperfect. I had to jury rig things like match highlighting; tools that come with these features do them better. I do suspect that using a bells and whistles search engine with fuzzy search would sacrifice the remarkable speed that TMLookup offers on very large databases.

[Edited at 2014-10-21 11:05 GMT]

Rolf Keller
Germany
Local time: 23:27
English to German

Software is a jungle, with or without Open Source

Oct 22, 2014

Alex Aruj wrote:
Rolf, any interest in opening up your code up for hacking on GitHub?

No, sorry.

port the code to other languages

Why? Just for fun? Other platforms might be a target, but other languages?

Multifultor is platform-dependent, anyway. Any porting would have to be done from scratch.

add more functionality

Please feel free to propose additional features that fit to Multifultor's basic concept. Most users never provide any feedback.

I am a fairly new programmer with some awareness of neat Python libraries and I am a C++ student currently in my third semester

Don't try to become a professional coder. Try to become a software architect. Professional coders who are older than 40 years and nevertheless happy & healthy are a rare species.

Dan Lucas

United Kingdom
Local time: 22:27
Member (2014)
Japanese to English

Have you considered a "real" search engine?

Oct 22, 2014

Alex Aruj wrote:
I haven't made it there yet to index my translation memories, glossaries, Word files, and pdfs

I had been looking into the use of things like TMLookup, but then I realised that I already have a dedicated search engine. Until I find something better, dtSearch searches dozens of gigabytes of text in a second.

Crucially for me, it can index and search for CJK characters. It also offers stemming, proximity, fuzzy concept and other searches. It deals with almost every file type under the sun, including multi-gigabyte Outlook .pst files. It's been around for nearly quarter of a century, the developer is responsive and the licensing terms are fair.

Incidentally, I own licenses for both X1 and Copernic (similar search engines) but I didn't like the lack of free support or the way they wanted me to pay for minor new features every year or so. dtSearch is stable and changes only occasionally because it already does what it does very well. I've not needed to spend any money on dtSearch for five years, maybe longer. I bought my copy back in the 1990s and as you can tell I have been a satisfied user ever since.

Of course, dtSearch is not a dedicated terminology application but it is such a flexible tool that it works well in this role, as well as being indispensable as a general search engine for your files and email. Unfortunately it's Windows and Linux only.

EDIT: This post should not be interpreted as a disparagement of TMLookup, which looks like a single-minded but impressive tool, or any of the other solutions. I just happen to already own and be familiar with dtSearch. YMMV.

Regards
Dan

[Edited at 2014-10-22 14:40 GMT]

Michael Beijer

United Kingdom
Local time: 22:27
Member (2009)
Dutch to English
+ ...

@Dan:

Oct 22, 2014

I'm currently testing dtSearch, and have a question.

If I have it index my Dropbox folder, which is currently 150GB, how can I make it so I don't have to update my entire index every time I add a couple of files here and there? I know I can create several indices, and update only the relevant ones, but I don't want to do that.

The last time I tested, updating my index took around 4-5 hours. That is, it seems to be trawling through my entire Dropbox folder every time I update the index. I'm sure there must be a faster way to do this. Can't dtSearch figure out which file haves been changed (using some kind of file journalling or metadata system?). Isn't that what intelligent backup software does: monitor file changes on your system so it doesn't have to copy everything, every time?

But I agree, dtSearch is absolutely amazing. I have a licence for the latest version of Copernic (which I keep trying to like, mainly because I like it's UI), but basically never use it because of its terrible handling of any file bigger than approx. 10MB. Click on a file of, say, 100MB in the results list and look what happens

Michael ▲ Collapse

FarkasAndras

Local time: 23:27
English to Hungarian
+ ...

dtsearch

Oct 22, 2014

Dan Lucas wrote:
EDIT: This post should not be interpreted as a disparagement of TMLookup, which looks like a single-minded but impressive tool, or any of the other solutions. I just happen to already own and be familiar with dtSearch. YMMV.

Mileage does vary on this issue. I myself wouldn't use dtsearch for lookups on bilingual files (TMs, termbases, aligned texts). I feel that a tool specifically designed for handling bilingual formats does a better job, e.g. through searches along the lines of "source language segment contains X and target language segment contains Y", and a better hit list display (target language show at each hit in separate column). The obvious downside is that they can only work with files that have already been processed (aligned and imported).
Conversely, people who are used to desktop search tools are often reluctant to switch to a specialized search tool for TMs and stick with what they know. As an example, a gentleman ordered a large EU TM collection from me and requested it as a set of tables in HTML files, to be indexed by dtsearch. I don't think I even sent him tmx files or any other "bilingual" format at all. To me that's a little crazy, but to each their own. Desktop search tools have two things going for them: they aren't afraid of large datasets and they can handle many formats.

Using dtsearch for raw (unaligned) reference material and a CAT tool or a TM search tool (TMLookup, xbench) seems like a good mixed approach, but few people apart from Michael have the willingness to juggle several tools.

[Edited at 2014-10-22 16:42 GMT]

Dan Lucas

United Kingdom
Local time: 22:27
Member (2014)
Japanese to English

You're right, it doesn't re-index on the fly

Oct 22, 2014

Michael Beijer wrote:
Can't dtSearch figure out which file haves been changed (using some kind of file journalling or metadata system?). Isn't that what intelligent backup software does: monitor file changes on your system so it doesn't have to copy everything, every time?

That's what Copernic and X1 do, presumably they tap into whatever routines are used to write a file so they're notified when something changes. It was this function that attracted me to those two packages in the first place.

Perhaps dtSearch doesn't do this because one of the basic assumptions underlying dtSearch's operational model is that, rather than being on the same PC, the data to be indexed will frequently be on remote servers where trapping the file activity isn't possible? I've never asked David, maybe I should.

With dtSearch I typically have the update run as a scheduled background service a couple of times a day. So, yes, the index usually a few hours out of date. This isn't critical for my purposes. If I want to update the index manually, it takes literally a couple of minutes (see below).

When updating an index, make sure you have the following options:
"Index new or modified documents" --> Checked
"Clear index before adding documents" --> Unchecked
"Compress index after adding documents" --> Unchecked

In my case the data covered by this particular index is 130gb in size, consisting of about 2.7 million documents. I just ran the updater manually and it took under two minutes to check the index and add the 16 new files that needed to be indexed. So you should be seeing similar update times once the initial index has been created.

Regards
Dan

Dan Lucas

United Kingdom
Local time: 22:27
Member (2014)
Japanese to English

As-you-type hit list would be nice too

Oct 22, 2014

FarkasAndras wrote:
I feel that a tool specifically designed for handling bilingual formats does a better job, e.g. through searches along the lines of "source language segment contains X and target language segment contains Y", and a better hit list display (target language show at each hit in separate column). The obvious downside is that they can only work with files that have already been processed (aligned and imported).

I can't disagree; dtSearch doesn't have any terminology specific functions to smooth the translator's workflow.

Also, one of the slight dissatisfactions I have with dtSearch is that you don't get a hit-list that updates as you enter your search term, as in some other search engines or in the search functions of some IDEs and text editors.

Dan

Pages in topic: [1 2] >

Login to reply/comment

To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Natalie	[Call to this topic]
Peter Zauner	[Call to this topic]
Prachya Mruetusatorn	[Call to this topic]

You can also contact site staff by submitting a support request »

How do you speed up your term/phrase search process (for TM, glossary, termbases)?

Translation news related to CAT tools

» Memsource Sells to Carlyle: The Inside Story
(0 comments)
» memoQ 9.4: Turbo-Charging Productivity
(0 comments)
» The Future Of Work Now: The Computer-Assisted Translator And Lilt
(0 comments)

Submit translation news about CAT tools »
Read more translation news »

Forum rules

Help and orientation

Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators. Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way. More info »

Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business. More info »

Recent posts | FAQ | Rules | Moderators | Article knowledgebase

Your current localization setting

English

Select a language

More languages...

How do you speed up your term/phrase search process (for TM, glossary, termbases)?

How do you speed up your term/phrase search process (for TM, glossary, termbases)?

You have native languages that can be verified

Your current localization setting

Select a language