Uploaded image for project: 'Qt'
  1. Qt
  2. QTBUG-106795

QPdfDocument::getAllText missing characters

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • P3: Somewhat important
    • None
    • 6.3.2
    • PDF
    • None
    • Windows

    Description

      QPdfDocument::getAllText does not return all characters, some characters are missing. Please check the out.pdf I posted. The out.txt is generated by fitz+PyMuPDF:

      python3 -m fitz gettext -pages 1 out.pdf

      It works fine. But result from QPdfDocument::getAllText missing some charactors, I put the result in getAllText.txt file. Here is the diff :

      getAllText: 是一个共享 ,供 个 系 统 (如在计算 机之

      PyMuPDF: 接口是一个共享框架,供 个 系 统 (如在计算机和打印机之间

      As it shows , a lot character are missing. I think pdfium returned wrong result, but chrome can handle this pdf correctly (copy works fine, along with other pdf viewers ). May be it's relevant to chromium version Qt used?

      Attachments

        1. 2022-09-20 22_08_33-test2.pdf - SumatraPDF.png
          2022-09-20 22_08_33-test2.pdf - SumatraPDF.png
          223 kB
        2. getAllText.txt
          3 kB
        3. out.pdf
          98 kB
        4. out.txt
          4 kB

        Activity

          People

            srutledg Shawn Rutledge
            zhaohongjian000 zhao hongjian
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: