Removing line break to make text *flow*
Thread poster: Mette Hansen
Mette Hansen
Mette Hansen  Identity Verified
Denmark
Local time: 16:07
Member (2002)
English to Danish
+ ...
Nov 21, 2003

I just saved a PDF file to RTF in order to make a TM in WinAlign, but the problem is the line breaks. The lines break in the middle of a sentence in steaf of at the end of the sentence at the full-stop.
Does anybody know how to remove line breaks and make the text flow?
Thanks in advance.


 
Ralf Lemster
Ralf Lemster  Identity Verified
Germany
Local time: 16:07
English to German
+ ...
Search and Replace Nov 21, 2003

You can use Search and Replace, searching for a paragraph break and replacing it with nothing (or a space, depending on your text structure).

BTW saving PDFs as .doc files has improved significantly with Acrobat 6.

HTH, Ralf


 
murat Karahan
murat Karahan  Identity Verified
Türkiye
Local time: 17:07
English to Turkish
+ ...
just do this Nov 21, 2003

You may have a Word program with an interface in a different language but these commands don't change. Open the search/replace window and type ^p in the search line and leave the replace section empty. This way you can remove all the paragraph breaks.

 
sylver
sylver  Identity Verified
Local time: 22:07
English to French
A little pointer Nov 22, 2003

There is no way to do that perfectly in one shot, but there are ways to minimize the damages. Try the import function of wordfast. It tries to guess where the paragraph marks should be left and where they should be removed. (That can be done very easily with the demo version - free)

Check the manual for instructions and settings. If your PDF is long, that little trick could save you hours.

[Edited at 2003-11-22 14:18]


 
sylver
sylver  Identity Verified
Local time: 22:07
English to French
NB Nov 22, 2003

Line break and paragraphe marks are not the same. What you have are paragraph marks, (^P or ^013 for search purposes) whereas a line break is (^l)

 
Ari Nuncio
Ari Nuncio  Identity Verified
United States
Local time: 09:07
Spanish to English
+ ...
Another approach Nov 22, 2003

I agree with Sylver: this is not going to be a one-shot operation, especially if you don't have WordRight or the demo (which now limits you to files of 110 K or less). But there's a slightly more sophisticated approach that could save you from eliminating all paragraph marks indiscriminately (some of which correspond to real paragraphs) or (on large documents) removing hundreds or even thousands of paragraphs one by one.

In Word, open the Search and Replace module. Type " ^p" in the
... See more
I agree with Sylver: this is not going to be a one-shot operation, especially if you don't have WordRight or the demo (which now limits you to files of 110 K or less). But there's a slightly more sophisticated approach that could save you from eliminating all paragraph marks indiscriminately (some of which correspond to real paragraphs) or (on large documents) removing hundreds or even thousands of paragraphs one by one.

In Word, open the Search and Replace module. Type " ^p" in the find box and "^p" in the replace box. Notice that there's a space before the paragraph symbol before the "^p" in the find box. What you're doing here is replacing all paragraph marks with a space before them. This may not be necessary for PDF imports, but let's assume you want a technique that will work for any document with unnecessary paragraph marks (I still get them on a regular basis). Hit the "Replace All" button. Now do it again, to make sure you have no extra spaces at the end of lines.

Before I describe the rest of the process, allow me to point out that there's a way to avoid that extra step (indeed, there's a way to make this whole process fully automated) using Visual Basic for Applications. If you're using Word 2000 or XP, I'd be happy to provide you with a template that you can use for just this purpose.

Next step. In Search and Replace, under Search Options, click the box that says Wildcards. In the Find box, type "[a-z]^13[a-z]." Leave the Replace box blank, but use the Format button to select Highlight. What you're doing here is highlighting all paragraph marks that have been placed between two lower-case letters.

Under Search Options, deselect Wildcards. Now type "^p" in the Find box. Hit the Format button and select Highlight. The Find box should now have "^p" after "Find what" and it should say Format: Highlight under the box. Go to the Replace box and type in a blank space. Hit the Replace All button.

That takes care of all the paragraph marks between words in lower case. But if the document you're trying to clean has lots of letters in caps that are not the beginning of sentences (or, say, you're working on a document in German, which capitalizes all nouns), you'll need to perform another operation, this time involving a certain amount of risk.

The idea is to remove any paragraph mark that does not occur after the end of a sentence (i.e., ending in a period). The problem is that you may also remove paragraph marks after titles, which in many languages do not end in periods. So your decision to proceed with the next step depends on how prevalent capitalization is in the body of your text. If it's the exception and not the rule, then now would be a good time to start a semi-manual search (as described by others on this page) to remove remaining rogue paragraph marks.

Assuming heavy capitalization is the rule and not the exception in your text, type "[a-z]^13[A-Z]" in the Find box. As above, leave the replace box blank (making absolutely sure that there is no blank space " " in it), and use the Format button to select Highlight. Hit the Replace All button.

Now all sentences that end in a lower-case letter and begin with an upper-case letter have been highlighted. This could include titles. You'll need to go through your text and place a unique non-text character (not A-Z) at the end of each title line. Use a character that normally is not used in the language your text is in or one that you're sure does not appear anywhere else in the text. For the sake of this exercise, let's say that the character is "¿" (an upside-down question mark).

Now you're ready to replace all highlighted paragraphs again. So, as above: under Search Options, deselect Wildcards. Type "^p" in the Find box. Hit the Format button and select Highlight. The Find box should now have "^p" after "Find what" and it should say Format: Highlight under the box. Go to the Replace box and type in a blank space. Hit the Replace All button.

The last step is easy. We need to remove the upside-down question mark after titles. I suggested that you use a unique character that you're sure does not appear elsewhere in the text, but what if the text is huge and unpredictable, and you can't be sure of anything? Let's cover our bets by typing the following into the Find box: "¿^p". In the Replace box, type "^p". Replace All.

Any remaining issues will have to be corrected semi-manually. Admittedly, this process is not without drawbacks. But if the document in question is large, you will save yourself a lot of time.

If you'd like the template I mentioned, contact me at [email protected]. No charge, of course.
Collapse


 
Mette Hansen
Mette Hansen  Identity Verified
Denmark
Local time: 16:07
Member (2002)
English to Danish
+ ...
TOPIC STARTER
Thank you so much!!! Nov 23, 2003

Dear Ralf,
I used your method and it worked perfectly. You have just saved me countless of hours of work for this project and many future ones.
I thank you with all my heart.
Sincerely,
Mette
Ralf Lemster wrote:

You can use Search and Replace, searching for a paragraph break and replacing it with nothing (or a space, depending on your text structure).

BTW saving PDFs as .doc files has improved significantly with Acrobat 6.

HTH, Ralf


 
Shawn Champion
Shawn Champion  Identity Verified
Sweden
Local time: 16:07
Swedish to English
+ ...
Brilliant! May 20, 2011

[quote]Ari Nuncio wrote:

......

Next step. In Search and Replace, under Search Options, click the box that says Wildcards. In the Find box, type "[a-z]^13[a-z]." Leave the Replace box blank, but use the Format button to select Highlight. What you're doing here is highlighting all paragraph marks that have been placed between two lower-case letters.......
..........


Some of what you wrote in your post didn't work for me, some was over zealous, but most worked. After fiddling with it, over 500 rogue paragraph tags were removed from an 8000 word document. This will save me hours of work. Thanks!


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 16:07
English to Hungarian
+ ...
Remove single line breaks May 20, 2011

If the file is no longer than about 10 pages, it's best to just use find and replace manually. Using the keyboard shortcuts in the find and replace window, you can do a page in under 30 seconds.

If the file is too long for this and you need an autmated solution, the simplest option is to just remove single line breaks. It's a lot simpler than some of the ideas proposed above, and likely to work just as well.
When you export a pdf, paragraphs and other units are usually separat
... See more
If the file is no longer than about 10 pages, it's best to just use find and replace manually. Using the keyboard shortcuts in the find and replace window, you can do a page in under 30 seconds.

If the file is too long for this and you need an autmated solution, the simplest option is to just remove single line breaks. It's a lot simpler than some of the ideas proposed above, and likely to work just as well.
When you export a pdf, paragraphs and other units are usually separated by two line breaks, while the rogue line breaks are singles. Aligners segment the text anyway, so merging a couple of segments by accident isn't going to cause a huge problem; it's better to be a bit overzealous with merging than than to miss split segments by being too cautious. If all goes well, the segmenter will just split any accidentally merged segments again.
If you want to do this in Word, you could replace ^p^p with, say, XXX@@@linebreak@@@XXX, then replace ^p with a space (not with nothing as suggested above!). Then replace XXX@@@linebreak@@@XXX with ^p. You can also replace multiple spaces with single spaces in case the previous step introduced any superfluous spaces.

BTW LF Aligner does all this automatically if you feed it a pdf file (or the txt export of a pdf file).

[Edited at 2011-05-20 12:10 GMT]
Collapse


 


To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Laureana Pavon[Call to this topic]

You can also contact site staff by submitting a support request »

Removing line break to make text *flow*






CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

Buy now! »
Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

Buy now! »