How to extract text from Acrobat Reader Thread poster: Céline Graciet
|
Hi everyone, I\'m trying to extract text from a PDF file, but to no avail. It won\'t let me highlight or select any of it. I even tried scanning it to then send the picture to Word, but the result was major mumble jumble. Hope the more technically minded amongst you will come to my rescue ! | | | Natalie Poland Local time: 23:56 Member (2002) English to Russian + ... MODERATOR SITE LOCALIZER
I am pretty sure that your pdf file is in fact a grafical pdf, so the best way would be opening it in an OCR application able of reading pdfs (for example, FineReader 6) and convert it to text. You may contact me privately if you need any technical help.
Best, Natalia | | | Endre Both Germany Local time: 23:56 English to German No way to get around scanning | Jan 15, 2003 |
...or, more precisely, reading the PDF into a character recognition (OCR) software, if your PDF is an all graphics file (indicated by the impossibility of highlighting text).
The results of course depend on your OCR software and the settings you apply before recognition.
In any case, the procedure is likely to involve a lot of work (I\'ve just spent a few hours on a similar task) and only pays off if the text contains lots of repetitions and you can use ... See more ...or, more precisely, reading the PDF into a character recognition (OCR) software, if your PDF is an all graphics file (indicated by the impossibility of highlighting text).
The results of course depend on your OCR software and the settings you apply before recognition.
In any case, the procedure is likely to involve a lot of work (I\'ve just spent a few hours on a similar task) and only pays off if the text contains lots of repetitions and you can use a CAT software afterwards. Otherwise, just use a printout and type the translation into Word.
A basic rule of mine, BTW: no discounts for repetitions in PDF texts.
Feel free to get in touch with me directly if you think I can help you.
Endre EB Communications ▲ Collapse | | | TService (X) Local time: 23:56 English to German Three possible reasons. | Jan 16, 2003 |
1) There are some kinds of protected PDFs around, allowing you to view the contents only and preventing any attempt to copy. Solution: Request an unprotected version.
2) Some PDFs cannot be opened correctly with the free version of Acrobat Reader. Solution: Get Acrobat 5 - but it\'s quite costly.
3) Some PDFs just show \"garbage\" when copied and pasted into another application. Solution: Contact me; I wrote a tiny algorithm... See more 1) There are some kinds of protected PDFs around, allowing you to view the contents only and preventing any attempt to copy. Solution: Request an unprotected version.
2) Some PDFs cannot be opened correctly with the free version of Acrobat Reader. Solution: Get Acrobat 5 - but it\'s quite costly.
3) Some PDFs just show \"garbage\" when copied and pasted into another application. Solution: Contact me; I wrote a tiny algorithm to decode that \"garbage\" using MS Access. ▲ Collapse | |
|
|
monitor Local time: 23:56 English to German + ... more than one solution | Jan 16, 2003 |
Hi Céline - first you should try to find out whether your actual pdf is copy protected. If this is the case safe the file under a new file name which in most cases removes the protect mode. In order to do so you need to have Adobe Acrobat, so not just the Reader. - In Adobe Acrobat you can safe text directly while exporting into an rtf-file. - you should also consider Gemini solo, a file / image extraction tool from inceni.com, which can be downloaded as trial vers... See more Hi Céline - first you should try to find out whether your actual pdf is copy protected. If this is the case safe the file under a new file name which in most cases removes the protect mode. In order to do so you need to have Adobe Acrobat, so not just the Reader. - In Adobe Acrobat you can safe text directly while exporting into an rtf-file. - you should also consider Gemini solo, a file / image extraction tool from inceni.com, which can be downloaded as trial version for free (restricted usage) but it works. Hope this is all fine for you Kind Regards Marcel The protect mode cannot be ommited by using Acrobat Reader!
[ This Message was edited by:on2003-01-16 09:17] ▲ Collapse | | |
Following on some of your advice, I downloaded a freeware OCR. Ok, it didn\'t work (wouldn\'t save my document as a Word doc) but it was good to try! It\'s called WebOCR and seems really good, if you can make it work... | | | dkalinic Local time: 23:56 Croatian to German + ... In memoriam Abbyy FineReader works fine with PDF files | Jan 16, 2003 |
You might try using Abbyy FineReader. It reads and extracts PDF files as Word documents. The graphics stays there too.
Greetings, Davor | | | monitor Local time: 23:56 English to German + ... Abbyy is it!!! | Jan 17, 2003 |
After the last comment I went to the bookstore bought Fine Reader and had it installed on my notebook. I took a 24 pages corporate brochure in pdf and had it imported and extracted into word 2000. Wow!!! Never seen that before. Buy version 6.0 with that new feature and you are safe, once and forever Marcel | |
|
|
Simona Oliva (X) France Local time: 23:56 French to Italian + ... click on a button | Feb 10, 2003 |
Hi Celine,
This reply might come too late but I just found a button in Acrobat Reader called \"select a text\" (there is a T and a small square on the right hand side). If you click on it, you will be able to highlight the text you need, then right-click on your mouse and eventually copy and paste it onto a Word doc. Hope it helps. Simona | | |
pdf2txt will change the text from pdf to a plain text file. This can be helpful but you remove all formatting when doing this. It is fairly inexpensive at $38.00 for a license. There is a free trial as well. For more info see: http://www.verypdf.com/pdf2txt/pdf2txt.htm
You can also use pstotext. It is a bit more difficult to use so if you aren\'t very tech savy it probably isn\'... See more pdf2txt will change the text from pdf to a plain text file. This can be helpful but you remove all formatting when doing this. It is fairly inexpensive at $38.00 for a license. There is a free trial as well. For more info see: http://www.verypdf.com/pdf2txt/pdf2txt.htm
You can also use pstotext. It is a bit more difficult to use so if you aren\'t very tech savy it probably isn\'t for you. You need to install GhostScript on your system and GhostView (both free) and then pstotext and then execute the extract function. This doesn\'t handle every type of pdf but it will handle many of them. You can find out more about it at: http://www.research.compaq.com/SRC/virtualpaper/pstotext.html
A list of other tools can be found at: http://www.pdfzone.com/toolbox/toolfilter.html This page tells you all you wanted to know about PDF\'s but would rather never have to learn.
All tools to do a word count from pdf including Adobe Acrobat do have one weakness in that you can make a PDF that is nothing more than a scanned page without any OCR. This makes a PDF that is nothing more than a picture so there would be no way to extract a word count from this type of file without using an OCR program yourself. ▲ Collapse | | | To report site rules violations or get help, contact a site moderator: You can also contact site staff by submitting a support request » How to extract text from Acrobat Reader Anycount & Translation Office 3000 | Translation Office 3000
Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.
More info » |
| Trados Business Manager Lite | Create customer quotes and invoices from within Trados Studio
Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.
More info » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |