目前共有10篇帖子。 字体大小:较小 - 100% (默认)▼  内容转换:不转换▼
 
点击 回复
42 9
jsoup: Java HTML Parser
一派掌门 二十级
1楼 发表于:2025-8-12 20:59

jsoup is a Java library that simplifies working with real-world HTML and XML. It offers an easy-to-use API for URL fetching, data parsing, extraction, and manipulation using DOM API methods, CSS, and xpath selectors.

jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers.


Scrape and parse HTML from a URL, file, or string.
Find and extract data using DOM traversal or CSS selectors.
Manipulate HTML elements, attributes, and text.
Clean user-submitted content against a safelist to prevent XSS attacks.
Output tidy HTML.

jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.

https://jsoup.org/

一派掌门 二十级
2楼 发表于:2025-8-12 21:03

Which versions of Java are compatible with jsoup?
Java Version Requirements
Jsoup is a popular Java library for parsing and manipulating HTML documents. The Java version requirements have evolved over time as jsoup has been updated to take advantage of newer Java features.

 

Current jsoup Versions (1.15.0+)
Modern jsoup versions require Java 8 or higher: - Minimum: Java 8 (Java 1.8) - Required - Recommended: Java 11 or Java 17 (LTS versions) - Supported: Java 8, 11, 17, 21, and newer versions

 

Legacy jsoup Versions
For older Java environments:

- jsoup 1.14.x and earlier: Compatible with Java 6+

- jsoup 1.13.x: Compatible with Java 6+

- jsoup 1.12.x and earlier: Compatible with Java 5+

https://webscraping.ai/faq/jsoup/which-versions-of-java-are-compatible-with-jsoup

 
一派掌门 二十级
3楼 发表于:2025-8-12 21:12

jsoup 1.14.3

jsoup 1.14.3 is out now, adding native XPath selector support, improved <template> support, and also includes a bunch of bug fixes, improvements, and performance enhancements.

See the release announcement for the full changelog.

https://repo1.maven.org/maven2/org/jsoup/jsoup/1.14.3/
https://repo1.maven.org/maven2/org/jsoup/jsoup/1.14.3/jsoup-1.14.3.jar

 
一派掌门 二十级
4楼 发表于:2025-8-12 21:32

实测2017年6月发布的jsoup-1.10.3是支持JDK1.6的最后版本。

jsoup-1.10.3.jar 2017-06-11 19:15 355356

https://repo1.maven.org/maven2/org/jsoup/jsoup/1.10.3/jsoup-1.10.3.jar

 
巨大八爪鱼

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class Test {
 public static void main(String[] args) throws IOException {
  Document document = Jsoup.connect("http://cn.bing.com/").get();
  System.out.println("Title: " + document.title());
 }
}

  2025-8-12 21:32 回复
巨大八爪鱼:这个版本的jsoup似乎不支持xpath,可以用java6自带的javax.xml.xpath.XPath代替。
  2025-8-12 22:42 回复
一派掌门 二十级
5楼 发表于:2025-8-12 21:36

Use DOM methods to navigate a document:

https://jsoup.org/cookbook/extracting-data/dom-navigation

 

Element a = document.getElementById("id_s");
System.out.println(a.html());

 

 
一派掌门 二十级
6楼 发表于:2025-8-14 19:44
import java.io.ByteArrayInputStream;
import java.io.IOException;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;

import org.jsoup.Connection;
import org.jsoup.Connection.Response;
import org.jsoup.Jsoup;
import org.w3c.dom.Document;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

public class Test3 {
    public static void main(String[] args) {
        String url = "https://zh.purasbar.com/images/scripts/?debug=https://zh.wikipedia.org/w/api.php?action=parse&format=xml&page=%E8%B0%83%E8%AF%95&variant=no&redirects";
        Document document = loadXMLDocument(url);
        if (document != null) {
            XPath xpath = XPathFactory.newInstance().newXPath();
            try {
                NodeList nodes = (NodeList)xpath.evaluate("/api/parse", document, XPathConstants.NODESET);
                Node node = nodes.item(0);
                String title = getNodeAttribute(node, "title");
                System.out.println(title);
            } catch (XPathExpressionException e) {
                e.printStackTrace();
            }
        }
    }
    
    public static String getNodeAttribute(Node node, String name) {
        NamedNodeMap attributes = node.getAttributes();
        Node attribute = attributes.getNamedItem(name);
        return attribute.getNodeValue();
    }
    
    public static String loadXMLString(String url) throws IOException {
        Connection conn = Jsoup.connect(url);
        conn.validateTLSCertificates(false);
        conn.ignoreContentType(true);
        Response resp = conn.execute();
        return resp.body();
    }
    
    public static Document loadXMLDocument(String url) {
        try {
            DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
            DocumentBuilder builder = factory.newDocumentBuilder();
            String xmlstr = loadXMLString(url);
            byte[] xmlbytes = xmlstr.getBytes("UTF-8");
            ByteArrayInputStream xmlstream = new ByteArrayInputStream(xmlbytes);
            return builder.parse(xmlstream);
        } catch (IOException e) {
            e.printStackTrace();
        } catch (ParserConfigurationException e) {
            e.printStackTrace();
        } catch (SAXException e) {
            e.printStackTrace();
        }
        return null;
    }
}

输出结果:
 
一派掌门 二十级
7楼 发表于:2025-8-14 19:53
nodes = (NodeList)xpath.evaluate("/api/parse/text", xmlDoc, XPathConstants.NODESET);
node = nodes.item(0);
String html = node.getFirstChild().getNodeValue();
org.jsoup.nodes.Document document = Jsoup.parse(html);
org.jsoup.nodes.Element body = document.body();
System.out.println(body.child(0).nodeName());
System.out.println(body.child(0).className());

输出结果:
div
mw-content-ltr mw-parser-output
 
一派掌门 二十级
8楼 发表于:2025-8-14 22:14
Preserving Line Breaks When Using Jsoup:
https://www.baeldung.com/jsoup-line-breaks
 

回复帖子

内容:
用户名: 您目前是匿名发表
验证码:
(快捷键:Ctrl+Enter)
 

本帖信息

点击数:42 回复数:9
评论数: ?
作者:巨大八爪鱼
最后回复:巨大八爪鱼
最后回复时间:2025-8-14 22:14
 
©2010-2025 Purasbar Ver2.0
除非另有声明,本站采用知识共享署名-相同方式共享 3.0 Unported许可协议进行许可。