Problems getting text from web pages
Hi,
I have an Excel file with 330+ links to web pages. I need to extract the text from all the web pages to do a clustering task.
I'm not able to achieve this with the usual process:
get pages >> data to documents >> loop collection > extract content >> documents to data.
The problem is that the operators are only able to extract (for all the pages) the same that we get with the "view page source" directly in the browser. So, what I get is an empty Text attribute.
I tested with only one link (https://dre.pt/dre/detalhe/despacho/3219-2020-130112149)with the operator Get Page. This is what I get in the extracted document:
How can I get the text from this? Can you help me, please?
I have an Excel file with 330+ links to web pages. I need to extract the text from all the web pages to do a clustering task.
I'm not able to achieve this with the usual process:
get pages >> data to documents >> loop collection > extract content >> documents to data.
The problem is that the operators are only able to extract (for all the pages) the same that we get with the "view page source" directly in the browser. So, what I get is an empty Text attribute.
I tested with only one link (https://dre.pt/dre/detalhe/despacho/3219-2020-130112149)with the operator Get Page. This is what I get in the extracted document:
< meta http-equiv =“- type”内容= " text / html; charset=utf-8" />
< meta http-equiv = " X-Content-Security-Policy”孔蒂nt="base-uri 'self'; child-src * gap:; frame-src * gap:; connect-src *; default-src 'self' 'unsafe-inline' *.turtlecreekpls.com *.hotjar.com *.turtlecreekpls.com *.dre.pt *.hotjar.io *.doubleclick.net *.knightlab.com *.google.com *.google.pt gap: 'unsafe-inline' 'unsafe-eval'; font-src 'self' data:; img-src * blob:; script-src 'unsafe-inline' * 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; frame-ancestors *.incm.pt *.dre.pt 'self' gap:; report-uri /SecurityUtils/rest/Report/ReportViolations?Params=RyAoRWX6RljInm%2B3hwnwrmeQNc96mBOvSkaT58%2FC4zhhZv0xIQAa3h3ft2scL67pequ212Wx6csuqpGp8%2B%2F%2B%2Bw%3D%3D; " />
(function () {
function appendMetaTagAttributes(metaTag, attribute, values) {
var elem = document.querySelector("meta[name=" + metaTag + "]");
if (elem) {
var attrContent = elem.getAttribute(attribute);
elem.setAttribute(attribute, (attrContent ? attrContent + "," : "") + values.join(","));
}
}
if (navigator && /OutSystemsApp/i.test(navigator.userAgent)) {
// If this app is running on the native shell, we want to disable the zoom
appendMetaTagAttributes("viewport", "content", ["user-scalable=no", "minimum-scale=1.0"]);
}
})();
How can I get the text from this? Can you help me, please?
0
Answers
The webpage you are trying to crawl is java generated and has a lot of endpoints that provide the data that you are trying to grab.
I created a template that you can use an adjust to get the text you are tying to obtain.
I put the process running, and it resulted for one link (even do I cannot find where the link is defined!).
I need now to make the process run for several other links. Nevertheless, I was not able to address your comment:
"Adjust Line 1001 with a macro to loop through the documents you might need. You may also need to adjust the request header token if it expires."
because I don´t know to use macros, and do not understand how to request the header token.
Can you help me, please?
Thanks.