How to get html of fully loaded page (with javascript) as input in java?


Question

I need to parse page, everything is ok except some elements on page are loaded dynamically. I used jsoup for static elements, then when I realized that I really need dynamic elements I tried javafx. I read a lot of answeres on stackoverflow and there were many recommendations to use javafx WebEngine. So I ended with this code.

@Override
public void start(Stage primaryStage) {
    WebView webview = new WebView();
    final WebEngine webengine = webview.getEngine();
    webengine.getLoadWorker().stateProperty().addListener(
            new ChangeListener<State>() {
                public void changed(ObservableValue ov, State oldState, State newState) {
                    if (newState == Worker.State.SUCCEEDED) {
                        Document doc = webengine.getDocument();
                        //Serialize DOM
                        OutputFormat format    = new OutputFormat (doc); 
                        // as a String
                        StringWriter stringOut = new StringWriter ();    
                        XMLSerializer serial   = new XMLSerializer (stringOut, format);
                        try {
                            serial.serialize(doc);
                        } catch (IOException e) {
                            e.printStackTrace();
                        }
                        // Display the XML
                        System.out.println(stringOut.toString());
                    }
                }
            });
    webengine.load("http://detail.tmall.com/item.htm?spm=a220o.1000855.0.0.PZSbaQ&id=19378327658");
    primaryStage.setScene(new Scene(webview, 800, 800));
    primaryStage.show();
} 

I made string from org.w3c.dom.Document and printed it. But it was useless too. primaryStage.show() showed me fully loaded page (with element I need rendered on page), but there was no element I need in html code (in output).

This is the third day I'm working on that issue, of course lack of experience is my main problem, nevertheless I have to say: I'm stuck. This is my first java project after reading java complete reference. I make it to get some real-world experience (and for fun). I want to make parser of chinese "ebay".

Here is the problem and my test cases:

http://detail.tmall.com/item.htm?spm=a220o.1000855.0.0.PZSbaQ&id=19378327658 need to get dynamically loaded discount "129.00"

http://item.taobao.com/item.htm?spm=a230r.1.14.67.MNq30d&id=22794120348 need "15.20"

As you can see, if you view this pages with browser at first you see original price and after a second or so - discount.

Is it even possible to get this dynamic discounts from html page? Other elements I need to parse are static. What to try next: another library to render html with javascript or maybe smth else? I really need some advice, don't want to give up.

1
8
8/3/2013 1:35:14 PM

Accepted Answer

DOM model returned after Worker.State.SUCCEEDED shoulb be already processed by javascript.

Your code worked for me with tested with FX 7u40 and 8.0 dev. I see next output in the log:

<DIV id="J_PromoBox"><EM class="tb-promo-price-type">夏季新品</EM><EM class="tm-yen">¥</EM>    
<STRONG class="J_CurPrice">129.00</STRONG></DIV>

which is dynamically loaded box with data (129.00) you looked for.

You may want to upgrade your JDK to 7u40 or revisit your log parsing algorithm.

1
8/5/2013 9:11:56 AM

It sounds like you want the rendered DOM from a dynamic page after the Javascript on the page has finished modifying the original HTML. This would not be easy to do in Java as you would need to implement browser-like functionality with an embedded Javascript engine. If you only care about reading a web page from Java, you might want to look into Selenium since it takes control of a browser and allows you to pull the rendered HTML into Java.

This answer might also help:

Render JavaScript and HTML in (any) Java Program (Access rendered DOM Tree)?


Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Icon