設置 | 登錄 | 註冊

作者共發了10篇帖子。

jsoup: Java HTML Parser

1樓 巨大八爪鱼 2025-8-12 20:59

jsoup is a Java library that simplifies working with real-world HTML and XML. It offers an easy-to-use API for URL fetching, data parsing, extraction, and manipulation using DOM API methods, CSS, and xpath selectors.

jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers.


Scrape and parse HTML from a URL, file, or string.
Find and extract data using DOM traversal or CSS selectors.
Manipulate HTML elements, attributes, and text.
Clean user-submitted content against a safelist to prevent XSS attacks.
Output tidy HTML.

jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.

https://jsoup.org/

2樓 巨大八爪鱼 2025-8-12 21:03

Which versions of Java are compatible with jsoup?
Java Version Requirements
Jsoup is a popular Java library for parsing and manipulating HTML documents. The Java version requirements have evolved over time as jsoup has been updated to take advantage of newer Java features.

 

Current jsoup Versions (1.15.0+)
Modern jsoup versions require Java 8 or higher: - Minimum: Java 8 (Java 1.8) - Required - Recommended: Java 11 or Java 17 (LTS versions) - Supported: Java 8, 11, 17, 21, and newer versions

 

Legacy jsoup Versions
For older Java environments:

- jsoup 1.14.x and earlier: Compatible with Java 6+

- jsoup 1.13.x: Compatible with Java 6+

- jsoup 1.12.x and earlier: Compatible with Java 5+

https://webscraping.ai/faq/jsoup/which-versions-of-java-are-compatible-with-jsoup

3樓 巨大八爪鱼 2025-8-12 21:12

jsoup 1.14.3

jsoup 1.14.3 is out now, adding native XPath selector support, improved <template> support, and also includes a bunch of bug fixes, improvements, and performance enhancements.

See the release announcement for the full changelog.

https://repo1.maven.org/maven2/org/jsoup/jsoup/1.14.3/
https://repo1.maven.org/maven2/org/jsoup/jsoup/1.14.3/jsoup-1.14.3.jar

4樓 巨大八爪鱼 2025-8-12 21:32

实测2017年6月发布的jsoup-1.10.3是支持JDK1.6的最后版本。

jsoup-1.10.3.jar 2017-06-11 19:15 355356

https://repo1.maven.org/maven2/org/jsoup/jsoup/1.10.3/jsoup-1.10.3.jar

巨大八爪鱼

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class Test {
 public static void main(String[] args) throws IOException {
  Document document = Jsoup.connect("http://cn.bing.com/").get();
  System.out.println("Title: " + document.title());
 }
}

巨大八爪鱼这个版本的jsoup似乎不支持xpath,可以用java6自带的javax.xml.xpath.XPath代替。
5樓 巨大八爪鱼 2025-8-12 21:36

Use DOM methods to navigate a document:

https://jsoup.org/cookbook/extracting-data/dom-navigation

 

Element a = document.getElementById("id_s");
System.out.println(a.html());

 

6樓 巨大八爪鱼 2025-8-14 19:44
import java.io.ByteArrayInputStream;
import java.io.IOException;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;

import org.jsoup.Connection;
import org.jsoup.Connection.Response;
import org.jsoup.Jsoup;
import org.w3c.dom.Document;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

public class Test3 {
    public static void main(String[] args) {
        String url = "https://zh.purasbar.com/images/scripts/?debug=https://zh.wikipedia.org/w/api.php?action=parse&format=xml&page=%E8%B0%83%E8%AF%95&variant=no&redirects";
        Document document = loadXMLDocument(url);
        if (document != null) {
            XPath xpath = XPathFactory.newInstance().newXPath();
            try {
                NodeList nodes = (NodeList)xpath.evaluate("/api/parse", document, XPathConstants.NODESET);
                Node node = nodes.item(0);
                String title = getNodeAttribute(node, "title");
                System.out.println(title);
            } catch (XPathExpressionException e) {
                e.printStackTrace();
            }
        }
    }
    
    public static String getNodeAttribute(Node node, String name) {
        NamedNodeMap attributes = node.getAttributes();
        Node attribute = attributes.getNamedItem(name);
        return attribute.getNodeValue();
    }
    
    public static String loadXMLString(String url) throws IOException {
        Connection conn = Jsoup.connect(url);
        conn.validateTLSCertificates(false);
        conn.ignoreContentType(true);
        Response resp = conn.execute();
        return resp.body();
    }
    
    public static Document loadXMLDocument(String url) {
        try {
            DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
            DocumentBuilder builder = factory.newDocumentBuilder();
            String xmlstr = loadXMLString(url);
            byte[] xmlbytes = xmlstr.getBytes("UTF-8");
            ByteArrayInputStream xmlstream = new ByteArrayInputStream(xmlbytes);
            return builder.parse(xmlstream);
        } catch (IOException e) {
            e.printStackTrace();
        } catch (ParserConfigurationException e) {
            e.printStackTrace();
        } catch (SAXException e) {
            e.printStackTrace();
        }
        return null;
    }
}

输出结果:
7樓 巨大八爪鱼 2025-8-14 19:53
nodes = (NodeList)xpath.evaluate("/api/parse/text", xmlDoc, XPathConstants.NODESET);
node = nodes.item(0);
String html = node.getFirstChild().getNodeValue();
org.jsoup.nodes.Document document = Jsoup.parse(html);
org.jsoup.nodes.Element body = document.body();
System.out.println(body.child(0).nodeName());
System.out.println(body.child(0).className());

输出结果:
div
mw-content-ltr mw-parser-output
8樓 巨大八爪鱼 2025-8-14 22:14
Preserving Line Breaks When Using Jsoup:
https://www.baeldung.com/jsoup-line-breaks

內容轉換:

回覆帖子
內容:
用戶名: 您目前是匿名發表。
驗證碼:
看不清?換一張
©2010-2025 Purasbar Ver3.0 [手機版] [桌面版]
除非另有聲明,本站採用知識共享署名-相同方式共享 3.0 Unported許可協議進行許可。