How to get html of fully loaded page (with javascript) as input in java?


I need to parse page, everything is ok except some elements on page are loaded dynamically. I used jsoup for static elements, then when I realized that I really need dynamic elements I tried javafx. I read a lot of answeres on stackoverflow and there were many recommendations to use javafx WebEngine. So I ended with this code.

public void start(Stage primaryStage) {
    WebView webview = new WebView();
    final WebEngine webengine = webview.getEngine();
            new ChangeListener<State>() {
                public void changed(ObservableValue ov, State oldState, State newState) {
                    if (newState == Worker.State.SUCCEEDED) {
                        Document doc = webengine.getDocument();
                        //Serialize DOM
                        OutputFormat format    = new OutputFormat (doc); 
                        // as a String
                        StringWriter stringOut = new StringWriter ();    
                        XMLSerializer serial   = new XMLSerializer (stringOut, format);
                        try {
                        } catch (IOException e) {
                        // Display the XML
    primaryStage.setScene(new Scene(webview, 800, 800));;

I made string from org.w3c.dom.Document and printed it. But it was useless too. showed me fully loaded page (with element I need rendered on page), but there was no element I need in html code (in output).

This is the third day I'm working on that issue, of course lack of experience is my main problem, nevertheless I have to say: I'm stuck. This is my first java project after reading java complete reference. I make it to get some real-world experience (and for fun). I want to make parser of chinese "ebay".

Here is the problem and my test cases: need to get dynamically loaded discount "129.00" need "15.20"

As you can see, if you view this pages with browser at first you see original price and after a second or so - discount.

Is it even possible to get this dynamic discounts from html page? Other elements I need to parse are static. What to try next: another library to render html with javascript or maybe smth else? I really need some advice, don't want to give up.

8/3/2013 1:35:14 PM

Accepted Answer

DOM model returned after Worker.State.SUCCEEDED shoulb be already processed by javascript.

Your code worked for me with tested with FX 7u40 and 8.0 dev. I see next output in the log:

<DIV id="J_PromoBox"><EM class="tb-promo-price-type">夏季新品</EM><EM class="tm-yen">¥</EM>    
<STRONG class="J_CurPrice">129.00</STRONG></DIV>

which is dynamically loaded box with data (129.00) you looked for.

You may want to upgrade your JDK to 7u40 or revisit your log parsing algorithm.

8/5/2013 9:11:56 AM

It sounds like you want the rendered DOM from a dynamic page after the Javascript on the page has finished modifying the original HTML. This would not be easy to do in Java as you would need to implement browser-like functionality with an embedded Javascript engine. If you only care about reading a web page from Java, you might want to look into Selenium since it takes control of a browser and allows you to pull the rendered HTML into Java.

