KAPOW

Welcome to the Kapow forum. Here you can get help, use your skills to help others and enjoy hanging out in the company of other Kapow Robot Developers.


Creating robots to crawl PDF

Share

E^E_2016

Posts : 10
Points : 351
Join date : 2016-08-29

Creating robots to crawl PDF

Post by E^E_2016 on Fri Sep 02, 2016 9:50 am

Hi,

I am really new to Kapow and was really hoping to get some direction. I could not find any tutorial on this so I am hoping that someone here
who knows can share their help .

I would like to only crawl certain contents inside PDF file. Not everything. Example: The PDF file content may already be in table format and contain
values apart from text. How can I create robots to properly structure the contents and only retrieve values and ommit text?
Because when you are extracting the contents from PDF the values and text are all messed up . 

I understand that those contents in PDF cannot be exported in excel format right? is there a workaround for this?

Hope you can guide me or provide your experience in resolving this. 

Thanks.
avatar
Shyam Kumar
Ranks

Posts : 73
Points : 1622
Join date : 2013-07-05
Location : Kerala, India

Re: Creating robots to crawl PDF

Post by Shyam Kumar on Fri Sep 02, 2016 12:18 pm

Hi,

Thanks for your post. We are always ready to help you.

Data extraction from pdf is somewhat difficult, because of the various types of pdf files.
We can export content from pdf  to excel format.

Can you give me the sample pdf files and mention what are the content you need to extract, which you want to export data from pdf.


Thank you.


Last edited by Shyam Kumar on Thu Jan 19, 2017 10:12 am; edited 1 time in total

E^E_2016

Posts : 10
Points : 351
Join date : 2016-08-29

Re: Creating robots to crawl PDF

Post by E^E_2016 on Fri Sep 02, 2016 1:05 pm

Hi,

Thanks for the reply. Do you mean that you could export content of PDF to Excel in Kapow? I tried using the function extract to excel and it says that the function would not work in PDF format. Could you probably advice me where or which step do I need to do in Kapow to create robots which can export PDF content into excel?
avatar
Shyam Kumar
Ranks

Posts : 73
Points : 1622
Join date : 2013-07-05
Location : Kerala, India

Re: Creating robots to crawl PDF

Post by Shyam Kumar on Fri Sep 02, 2016 2:14 pm

Hi,

Yes, we can export content of PDF to Excel in Kapow.

I am confusing about you are mentioning in your replay 'function extract to excel not working' can you please attach some screen shots, it may help me to give better solutions and which version you are using?

We can use lots of options in kapow for export content from pdf to excel depends on the type of pdf.

Are you using any Database?


If you are using database, you can convert the pdf file and extract data content in to a variable and store the data. Then you can export form database.

Other wise you can use 'Write File' Action step available in kapow and directly write the content to excel..

If you are giving some sample pdf files I will work on this and give you a proper solution.


Thank you


Last edited by Shyam Kumar on Thu Jan 19, 2017 10:12 am; edited 1 time in total

E^E_2016

Posts : 10
Points : 351
Join date : 2016-08-29

Re: Creating robots to crawl PDF

Post by E^E_2016 on Fri Oct 21, 2016 9:33 am

Hi ,

Thank you for reverting back so fast to me. Really appreciate it. I think I get what you are trying to say but it will be helpful if you can provide me some screenshots on how you can do this . Let me send over a sample of PDF here. Basically the requirement is to extract the transaction details in the bills statement which appears in table format.  Can you guide me through how can u create robot to perform this ? Thank you . 

Please download free for the PDF file here at this link since I cant attached ot here . Too big.
https://ufile.io/ed91

Hope to hear from you soon . thanks.
avatar
Shyam Kumar
Ranks

Posts : 73
Points : 1622
Join date : 2013-07-05
Location : Kerala, India

Re: Creating robots to crawl PDF

Post by Shyam Kumar on Mon Oct 24, 2016 5:02 pm

Hi,


In my understanding you need to extract only the TRANSACTION DETAILS from the PDF file.



If you need to do multiple pdf file, use action step, file system then select “for each file” action step or you can directly use url.



After loading the pdf file, convert the pdf file, use “Extract Binary Content” and “Extract from PDF”


Here extract the full binary content to the pdf varibale. And extract from binary use the same varibale. So we can show data.


The above mentioned pdf, you need to extract transaction details, when i research on the pdf file, all the contents of each transaction is located in a paragraph tag (<p>).


All the transaction details contents included in the tag <p> and tag start with date of transaction, So initial step you should extract date, because we are looping all the paragraph tags, if any paragraph tag is not satisfy the date extraction, we need to skip that and take next, because that is not a transaction details.



Then extract the normal contents what you need and take in a variable (Here i am using the kapow default ScratchPad variables)

If you are using any database you can insert data in to database table using “Store in Database” action step.


If you are directly write the content of the pdf means, you can simply write the contents to the CSV file using “Write File” action step.


In write file action step, you should give the file name(location).

File Name: /root/Desktop/Excel.csv // Here you can give your location

variable1+"\t"+variable2+"\t"+variable3+"\t"+variable4+"\t"+variable5+"\n"

\t (tab-comment for next column)

\n (Enter-comment for next Line)

Then run the robot and show the extracted data.




If you dont understand anything please let me know.




Thank you.


Regards,

Shyam kumar P

E^E_2016

Posts : 10
Points : 351
Join date : 2016-08-29

Re: Creating robots to crawl PDF

Post by E^E_2016 on Wed Oct 26, 2016 5:20 pm

Hi Shyam,

Thanks for going through the earlier exercise with me here. Your help has been greatly helpful and I also appreciate your time into this with me. I think the only area where I do not understand is at the patterns and expression configuration. Hopefully when you have time, we can discuss on this part more in detail. Thanks.

kaundalsajan10@gmail.com

Posts : 1
Points : 178
Join date : 2017-02-01

PDF

Post by kaundalsajan10@gmail.com on Wed Feb 01, 2017 6:35 pm

You can also configure the Robot using "Merge Text" option available in Extract from PDF step . Data will be displayed in 2 different formats by enabling/disabling this option.

jinitha kumari.j.r

Posts : 1
Points : 1480
Join date : 2013-07-08

Re: Creating robots to crawl PDF

Post by jinitha kumari.j.r on Fri Feb 03, 2017 12:26 pm

Hi Kaundalsajan

        By default the generated HTML from the PDF will merge text that is on the same line into one HTML element even though these are represented as different text in the PDF document.
   
        It is better to turn off this feature(Merge Text) if the PDF document contains more than one column.It will help to maintain the column structure.

     Regards
     Jinitha

    Current date/time is Thu Jul 27, 2017 1:38 am