'Nutch Issue: While crawling PDFs using nutch, PDF fetching properly but not Parsing
I am using nutch-2.3.1 with Hbase-0.98.8-hadoop2 and the crawl runs fine for HTML pages, but when trying to run the crawl for PDF URLs only some of them seems to parse most of them are not crawled to solr. I tried using parsechecker for the URLs and it is working fine. But while crawling the PDFs are not parsing through only. The fetching step is also working fine. What can I check in this situation
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|