Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
505 views
in Technique[技术] by (71.8m points)

java - How to parse HTML table using jsoup?

I am trying to parse HTML using jsoup. This is my first time working with jsoup and I read some tutorial on it as well. Below is my HTML table which I am trying to parse -

If you see my below table, it has three tr as of now (I have shorten it down to have three table rows just for understanding purpose but in general it will be more). Now I would like to extract Cluster Name from my below table and it's corresponding host name so for example - I would extract Titan as cluster name and all its hostname whose status are down.

As you can see below for Titan cluster name, I have two hostnames machineA.abc.com and machineB.abc.com in which machineA status is up but machineB status is down.

So I will print out Titan as cluster name and print out machineB.abc.com as the hostname since it is down. Is this possible to do using jsoup?

<table border=1>
   <tr>
      <td>&nbsp;</td>
      <td>&nbsp;</td>
      <td>Alert</td>
      <td>Cluster Name</td>
      <td>IP addr</td>
      <td>Host Name</td>
      <td>Type</td>
      <td>Status</td>
      <td>Free</td>
      <td>Version</td>
      <td>Restart Time</td>
      <td>UpTime(Days)</td>
      <td>Last probed</td>
      <td>Last up</td>
   </tr>
   <tr bgcolor="ffffff">
      <td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
      <td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
      <td bgcolor="ffffff">&nbsp</td>
      <td>Titan</td>
      <td>10.100.111.77</td>
      <td>machineA.abc.com</td>
      <td></td>
      <td bgcolor="ffffff">up</td>
      <td bgcolor="ffffff" align=right>88%</td>
      <td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
      <td bgcolor="ffffff">2014-07-04 01:49:08,220</td>
      <td bgcolor="ffffff" align=right>381</td>
      <td>07-14 20:01:59</td>
      <td>07-14 20:01:59</td>
   </tr>
   <tr bgcolor="ffffff">
      <td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
      <td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
      <td bgcolor="ffffff">&nbsp</td>
      <td></td>
      <td>10.200.192.99</td>
      <td>machineB.abc.com</td>
      <td></td>
      <td bgcolor="ffffff">down</td>
      <td bgcolor="ffffff" align=right>85%</td>
      <td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
      <td bgcolor="ffffff">2014-07-04 01:52:20,613</td>
      <td bgcolor="ffffff" align=right>103</td>
      <td>07-14 20:01:59</td>
      <td>07-14 20:01:59</td>
   </tr>
</table>

So far, I am able to extract whole HTML table using jsoup but not sure how would I extract cluster name and the hostnames which are down -

URL url = new URL("url_name");
Document doc = Jsoup.parse(url, 3000);

Update:-

I might have two cluster name in the table as shown below -

<table border=1>
   <tr>
      <td>&nbsp;</td>
      <td>&nbsp;</td>
      <td>Alert</td>
      <td>Cluster Name</td>
      <td>IP addr</td>
      <td>Host Name</td>
      <td>Type</td>
      <td>Status</td>
      <td>Free</td>
      <td>Version</td>
      <td>Restart Time</td>
      <td>UpTime(Days)</td>
      <td>Last probed</td>
      <td>Last up</td>
   </tr>
   <tr bgcolor="ffffff">
      <td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
      <td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
      <td bgcolor="ffffff">&nbsp</td>
      <td>Titan</td>
      <td>10.100.111.77</td>
      <td>machineA.abc.com</td>
      <td></td>
      <td bgcolor="ffffff">up</td>
      <td bgcolor="ffffff" align=right>88%</td>
      <td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
      <td bgcolor="ffffff">2014-07-04 01:49:08,220</td>
      <td bgcolor="ffffff" align=right>381</td>
      <td>07-14 20:01:59</td>
      <td>07-14 20:01:59</td>
   </tr>
   <tr bgcolor="ffffff">
      <td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
      <td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
      <td bgcolor="ffffff">&nbsp</td>
      <td></td>
      <td>10.200.192.99</td>
      <td>machineB.abc.com</td>
      <td></td>
      <td bgcolor="ffffff">down</td>
      <td bgcolor="ffffff" align=right>85%</td>
      <td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
      <td bgcolor="ffffff">2014-07-04 01:52:20,613</td>
      <td bgcolor="ffffff" align=right>103</td>
      <td>07-14 20:01:59</td>
      <td>07-14 20:01:59</td>
   </tr>
   <tr bgcolor="ffffff">
      <td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
      <td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
      <td bgcolor="ffffff">&nbsp</td>
      <td>Goldy</td>
      <td>10.100.111.77</td>
      <td>machineH.pqr.com</td>
      <td></td>
      <td bgcolor="ffffff">up</td>
      <td bgcolor="ffffff" align=right>88%</td>
      <td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
      <td bgcolor="ffffff">2014-07-04 01:49:08,220</td>
      <td bgcolor="ffffff" align=right>381</td>
      <td>07-14 20:01:59</td>
      <td>07-14 20:01:59</td>
   </tr>       
</table>

Now if you see above I have two cluster name - one is Titan and other is Goldy so I want to find all the machines which are down for Titan cluster name only.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Yes, it is possible with JSoup. First, you select the table. Then, you select the <tr> tags for rows. You can start from the second index since the first row contains only the column names. Then loop over the <th> tags and get the specific index. In your case, the indexes 7 and 5 are important(index 7: Status, index 5: Host Name). Check the status if it equals to down and if it is, then add the Host Name to a list. That's all.

ArrayList<String> downServers = new ArrayList<>();
Element table = doc.select("table").get(0); //select the first table.
Elements rows = table.select("tr");

for (int i = 1; i < rows.size(); i++) { //first row is the col names so skip it.
    Element row = rows.get(i);
    Elements cols = row.select("td");

    if (cols.get(7).text().equals("down")) {
        downServers.add(cols.get(5).text());
    }
}

Update: When you find the word Titan you can create another loop and look if the cluster name is empty.

Edit: I change the while loop to do while loop.

    ArrayList<String> downServers = new ArrayList<>();
    Element table = doc.select("table").get(0); //select the first table.
    Elements rows = table.select("tr");

    for (int i = 1; i < rows.size(); i++) { //first row is the col names so skip it.
        Element row = rows.get(i);
        Elements cols = row.select("td");

        if (cols.get(3).text().equals("Titan")) {
            if (cols.get(7).text().equals("down"))
                downServers.add(cols.get(5).text());

            do {
                if(i < rows.size() - 1)
                   i++;
                row = rows.get(i);
                cols = row.select("td");
                if (cols.get(7).text().equals("down") && cols.get(3).text().equals("")) {
                    downServers.add(cols.get(5).text());
                }
                if(i == rows.size() - 1)
                    break;
            }
            while (cols.get(3).text().equals(""));
            i--; //if there is two Titan names consecutively.
        }
    }

downServers ArrayList will contain the list of down servers hostnames.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...