Scale webscraper

#Scale webscraper full
#Scale webscraper code

While scraping web pages, it is a common use case to validate if a page effectively has the expected structure. res11: Iterable = List( // "Some text for testing", // "More text for testing" // ) Content Validation res10: List = List( // JsoupElement(Section 2) // ) // Extract the text inside each "p" element res9: String = "Test page Test page h1 Home Section 1 Section 2 Section 3 Test page h2 4.5 2 Section 1 h3 Some text for testing More text for testing Section 2 h3 My Form Add field Section 3 h3 3 15 15 1 No copyright 2014" // Extract the elements with class ".active" Extract all the "h3" elements (as a lazy iterable)ĭoc > "h3 " // res8: ElementQuery = LazyElementQuery( // JsoupElement(Section 1 h3), // JsoupElement(Section 2 h3), // JsoupElement(Section 3 h3) // ) // Extract all text inside this document

() is taken as extractor(,, asIs) (by an implicit conversion).īecause of that, one can write the expressions in the Quick Start section, as well as:.

is taken as extractor(":root",, asIs) (content extractors are also HtmlExtractor instances by themselves).

is taken as extractor(, elements, asIs) (by an implicit conversion).

With the help of the implicit conversions provided by the DSL, we can write more succinctly the most common extraction cases: res7: Element = JsoupElement(Test page h1) res6: TraversableOnce = non-empty iterator // Extract an element "h1" and do no parsing (the default parsing behavior)ĭoc > extractor( "h1 ", element, asIs) res5: = // Extract the text of all "#mytable td" elements and parse each of them as a numberĭoc > extractor( "#mytable td ", texts, seq(asDouble)) Extract the date from the "#date" elementĭoc > extractor( "#date ", text, asLocalDate( "yyyy-MM-dd ")) The > and >?> operators shown above accept an HtmlExtractor as their right argument, a trait with a very simple interface: More information about each browser and its semantics can be obtained in the Scaladoc of each implementation. It supports several compatibility modes, allowing it to emulate browsers such as Internet Explorer.ĭue to its speed and maturity, JsoupBrowser is the recommended browser to use when JavaScript execution is not needed.

#Scale webscraper code

HtmlUnitBrowser simulates thoroughly a web browser, executing JavaScript code in the pages in addition to parsing HTML. HtmlUnitBrowser is based on HtmlUnit, a GUI-less browser for Java programs.As such, it is limited to working strictly with the HTML sent in the page source

JsoupBrowser provides powerful and efficient document querying, but it doesn't run JavaScript in the pages. JsoupBrowser is backed by jsoup, a Java HTML parser library.The library currently provides two built-in implementations of Browser: Depending on the browser used, Document and Element instances may have different semantics, mainly on their immutability guarantees. Most notably, they implement get, post, parseFile and parseString methods for retrieving documents from different sources. The library represents HTML documents and their elements by Document and Element objects, simple interfaces containing methods for retrieving information and navigating through the DOM.īrowser implementations are the entrypoints for obtaining Document instances.

#Scale webscraper full

They are followed by a description of the full capabilities of the DSL, including the ability to parse content after extracting, validating the contents of a page and defining custom extractors or validators. In the next two sections the core classes used by this library are presented. Follow that link and print both the article title and its short // description (inside ".lead") for println( "= " + headline.text + " = \n " + headlineDesc) Go to a news website and extract the hyperlink inside the h1 element if it // exists. If the element may or may not be in the page, the >?> tries to extract the content and returns it wrapped in an Option: res2: String = "width=device-width, initial-scale=1"

res1: List = List("Home", "Section 1", "", "Section 3") // From the meta element with "viewport" as its attribute name, extract the // text in the content attributeĭoc > attr( "content ")( "meta ") items: List = List( // JsoupElement(Home), // JsoupElement(Section 1), // JsoupElement(Section 2), // JsoupElement(Section 3) // ) // From each item, extract all the text inside their elements res0: String = "Test page h1" // Extract the elements inside #menu val items = doc > elementList( "#menu span ") _ // Extract the text inside the element with id "header"