网络爬虫技术Jsoup ——快速掌握

Jsoup介绍

Jsoup 是一款Java 的HTML解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于jQuery的操作方法来取出和操作数据。

jsoup是基于MIT协议发布的，可放心使用于商业项目。

能用Jsoup实现什么？

从URL，文件或字符串中刮取并解析HTML
查找和提取数据，使用DOM遍历或CSS选择器
操纵HTML元素，属性和文本
根据安全的白名单清理用户提交的内容，以防止XSS攻击
输出整洁的HTML

爬虫技术没那么难，思路就是这么的简单

得到自己想要爬取数据的url.
通过Jsoup的jar包中的方法将Html解析成Document，
使用Document中的一些列get、first、children等方法获取自己想要的数据，如图片地址、名称、时间。
将得到的数据封装成自己的实体类。
将实体中的数据在页面加载出来。

JSoup应用的主要类

虽然完整的类库中有很多类，但大多数情况下，下面给出3个类是我们需要重点了解的。

1. org.jsoup.Jsoup类

Jsoup类是任何Jsoup程序的入口点，并将提供从各种来源加载和解析HTML文档的方法。

org.jsoup.Jsoup类提供了连接，清理和解析HTML文档的方法。重要方法如下：

方法名称	描述
`static Connection connect(String url)`	创建并返回URL的连接。
`static Document parse(File in, String charsetName)`	将指定的字符集文件解析成文档。
`static Document parse(File in, String charsetName, String baseUri)`	将指定的字符集和baseUri文件解析成文档。
`static Document parse(String html)`	将给定的`html`代码解析成文档。
`static Document parse(String html, String baseUri)`	用`baseUri`将给定的html代码解析成文档。
`static Document parse(URL url, int timeoutMillis)`	将给定的URL解析为文档。
`static String clean(String bodyHtml, Whitelist whitelist)`	将输入HTML返回安全的HTML，通过解析输入HTML并通过允许的标签和属性的白名单进行过滤。

2. org.jsoup.nodes.Document类

该类表示通过Jsoup库加载HTML文档。可以使用此类执行适用于整个HTML文档的操作。

Element类的重要方法可以参见 - http://jsoup.org/apidocs/org/jsoup/nodes/Document.html 。

查找元素

getElementById(String id)
getElementsByTag(String tag)
getElementsByClass(String className)
getElementsByAttribute(String key) (and related methods)
Element siblings: siblingElements(), firstElementSibling(), lastElementSibling(); nextElementSibling(), previousElementSibling()
Graph: parent(), children(), child(int index)

元素数据

attr(String key) 获取属性
attr(String key, String value) 设置属性
attributes() 获取所有属性
id(), className() 和 classNames()
text()获取文本内容
text(String value) 设置文本内容
html() 获取元素内HTML
html(String value) 设置元素内的HTML内容
outerHtml() 获取元素外HTML内容
data() 获取数据内容（例如：script和style标签)
tag() 和 tagName()

操作HTML和文本

append(String html), prepend(String html)
appendText(String text), prependText(String text)
appendElement(String tagName), prependElement(String tagName)
html(String value)

3. org.jsoup.nodes.Element类

HTML元素是由标签名称，属性和子节点组成。使用Element类，您可以提取数据，遍历节点和操作HTML。

Element类的重要方法可参见 - http://jsoup.org/apidocs/org/jsoup/nodes/Element.html 。

jsoup elements对象支持类似于CSS (或jquery)的选择器语法，来实现非常强大和灵活的查找功能。.

这个select 方法在Document, Element,或Elements对象中都可以使用。且是上下文相关的，因此可实现指定元素的过滤，或者链式选择访问。

Select方法将返回一个Elements集合，并提供一组方法来抽取和处理结果。

Selector选择器概述

tagname: 通过标签查找元素，比如：a
ns|tag: 通过标签在命名空间查找元素，比如：可以用 fb|name 语法来查找 <fb:name> 元素
#id: 通过ID查找元素，比如：#logo
.class: 通过class名称查找元素，比如：.masthead
[attribute]: 利用属性查找元素，比如：[href]
[^attr]: 利用属性名前缀来查找元素，比如：可以用[^data-] 来查找带有HTML5 Dataset属性的元素
[attr=value]: 利用属性值来查找元素，比如：[width=500]
[attr^=value], [attr$=value], [attr*=value]: 利用匹配属性值开头、结尾或包含属性值来查找元素，比如：[href*=/path/]
[attr~=regex]: 利用属性值匹配正则表达式来查找元素，比如： img[src~=(?i)\.(png|jpe?g)]
*: 这个符号将匹配所有元素

Selector选择器组合使用

el#id: 元素+ID，比如： div#logo
el.class: 元素+class，比如： div.masthead
el[attr]: 元素+class，比如： a[href]
任意组合，比如：a[href].highlight
ancestor child: 查找某个元素下子元素，比如：可以用.body p 查找在"body"元素下的所有 p元素
parent > child: 查找某个父元素下的直接子元素，比如：可以用div.content > p 查找 p元素，也可以用body > * 查找body标签下所有直接子元素
siblingA + siblingB: 查找在A元素之前第一个同级元素B，比如：div.head + div
siblingA ~ siblingX: 查找A元素之前的同级X元素，比如：h1 ~ p
el, el, el:多个选择器组合，查找匹配任一选择器的唯一元素，例如：div.masthead, div.logo

伪选择器selectors

:lt(n): 查找哪些元素的同级索引值（它的位置在DOM树中是相对于它的父节点）小于n，比如：td:lt(3) 表示小于三列的元素
:gt(n):查找哪些元素的同级索引值大于n，比如： div p:gt(2)表示哪些div中有包含2个以上的p元素
:eq(n): 查找哪些元素的同级索引值与n相等，比如：form input:eq(1)表示包含一个input标签的Form元素
:has(seletor): 查找匹配选择器包含元素的元素，比如：div:has(p)表示哪些div包含了p元素
:not(selector): 查找与选择器不匹配的元素，比如： div:not(.logo) 表示不包含 class=logo 元素的所有 div 列表
:contains(text): 查找包含给定文本的元素，搜索不区分大不写，比如： p:contains(jsoup)
:containsOwn(text): 查找直接包含给定文本的元素
:matches(regex): 查找哪些元素的文本匹配指定的正则表达式，比如：div:matches((?i)login)
:matchesOwn(regex): 查找自身包含文本匹配指定正则表达式的元素

注意：上述伪选择器索引是从0开始的，也就是说第一个元素索引值为0，第二个元素index为1等

应用实例

现在我们来看一些使用Jsoup API处理HTML文档的例子。

1. 载入文件

从URL加载文档，使用Jsoup.connect()方法从URL加载HTML。

Document document = Jsoup.connect("http://www.yiibai.com").get();
System.out.println(document.title());

connect(String url) 方法创建一个新的 Connection, 和 get() 取得和解析一个HTML文件。如果从该URL获取HTML时发生错误，便会抛出 IOException，应适当处理。

Connection 接口还提供一个方法链来解决特殊请求，具体如下：

Document doc = Jsoup.connect("http://example.com")
  .data("query", "Java")
  .userAgent("Mozilla")
  .cookie("auth", "token")
  .timeout(3000)
  .post();

这个方法只支持Web URLs (http和https 协议); 假如你需要从一个文件加载，可以使用 parse(File in, String charsetName) 代替。

2. 从文件加载文档

使用Jsoup.parse()方法从文件加载HTML。

Document document = Jsoup.parse( new File( "D:/temp/index.html" ) , "utf-8" );
System.out.println(document.title());

3. 从String加载文档

使用Jsoup.parse()方法从字符串加载HTML。

String html = "<html><head><title>First parse</title></head>"
                    + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document document = Jsoup.parse(html);
System.out.println(document.title());

4. 获取标题

获取URL的标题

Document doc = Jsoup.connect("http://www.yiibai.com").get();  
String title = doc.title();
Java

从HTML文件获取标题

Document doc = Jsoup.parse(new File("e:\\register.html"),"utf-8");//assuming register.html file in e drive  
String title = doc.title();

5. 获取HTML页面的Fav图标

假设favicon图像将是HTML文档的<head>部分中的第一个图像，您可以使用下面的代码。

String favImage = "Not Found";
try {
    Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/yiibai-index.html"), "utf-8");
    Element element = document.head().select("link[href~=.*\\.(ico|png)]").first();
    if (element == null) {
        element = document.head().select("meta[itemprop=image]").first();
        if (element != null) {
            favImage = element.attr("content");
        }
    } 
    else {
        favImage = element.attr("href");
    }
} catch (IOException e) {
    e.printStackTrace();
}
System.out.println(favImage);

6. 获取HTML页面中的所有链接

要获取网页中的所有链接，请使用以下代码。

Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/yiibai-index.html"), "utf-8");
Elements links = document.select("a[href]");  
for (Element link : links) {
     System.out.println("link : " + link.attr("href"));  
     System.out.println("text : " + link.text());  
}

7. 获取HTML页面中的所有图像

要获取网页中显示的所有图像，请使用以下代码。

Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/yiibai-index.html"), "utf-8");
Elements images = document.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
for (Element image : images) {
     System.out.println("src : " + image.attr("src"));
     System.out.println("height : " + image.attr("height"));
     System.out.println("width : " + image.attr("width"));
     System.out.println("alt : " + image.attr("alt"));
}

8. 获取URL的元信息

元信息包括Google等搜索引擎用来确定网页内容的索引为目的。它们以HTML页面的HEAD部分中的一些标签的形式存在。要获取有关网页的元信息，请使用下面的代码。

Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/yiibai-index.html"), "utf-8");

String description = document.select("meta[name=description]").get(0).attr("content");  
System.out.println("Meta description : " + description);  
    
String keywords = document.select("meta[name=keywords]").first().attr("content");
System.out.println("Meta keyword : " + keywords);

9. 在HTML页面中获取表单参数

在网页中获取表单输入元素非常简单。使用唯一ID查找FORM元素; 然后找到该表单中存在的所有INPUT元素。

Document doc = Jsoup.parse(new File("c:/temp/yiibai-index.html"),"utf-8");  
Element formElement = doc.getElementById("loginForm");  
Elements inputElements = formElement.getElementsByTag("input");  
for (Element inputElement : inputElements) {  
    String key = inputElement.attr("name");  
    String value = inputElement.attr("value");  
    System.out.println("Param name: "+key+" \nParam value: "+value);  
}

10. 更新元素的属性/内容

只要您使用上述方法找到您想要的元素; 可以使用Jsoup API来更新这些元素的属性或innerHTML。例如，想更新文档中存在的“rel = nofollow”的所有链接。

Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/yiibai.com.html"), "utf-8");
Elements links = document.select("a[href]");  
links.attr("rel", "nofollow");

11. 消除不信任的HTML(以防止XSS)

假设在应用程序中，想显示用户提交的HTML片段。例如用户可以在评论框中放入HTML内容。这可能会导致非常严重的问题，如果您允许直接显示此HTML。用户可以在其中放入一些恶意脚本，并将用户重定向到另一个脏网站。

为了清理这个HTML，Jsoup提供Jsoup.clean()方法。此方法期望HTML格式的字符串，并将返回清洁的HTML。要执行此任务，Jsoup使用白名单过滤器。 jsoup白名单过滤器通过解析输入HTML(在安全的沙盒环境中)工作，然后遍历解析树，只允许将已知安全的标签和属性(和值)通过清理后输出。

它不使用正则表达式，这对于此任务是不合适的。

清洁器不仅用于避免XSS，还限制了用户可以提供的元素的范围：您可以使用文本，强元素，但不能构造div或表元素。

String dirtyHTML = "<p><a href='http://www.yiibai.com/' onclick='sendCookiesToMe()'>Link</a></p>";
String cleanHTML = Jsoup.clean(dirtyHTML, Whitelist.basic());
System.out.println(cleanHTML);

执行后输出结果如下 -

<p><a href="http://www.yiibai.com/" rel="nofollow">Link</a></p>

12、获取所有链接

这个示例程序将展示如何从一个URL获得一个页面。然后提取页面中的所有链接、图片和其它辅助内容。并检查URLs和文本信息。

运行下面程序需要指定一个URLs作为参数

package org.jsoup.examples;

import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;

/**
 * Example program to list links from a URL.
 */
public class ListLinks {
    public static void main(String[] args) throws IOException {
        Validate.isTrue(args.length == 1, "usage: supply url to fetch");
        String url = args[0];
        print("Fetching %s...", url);
        
        Document doc = Jsoup.connect(url).get();
        Elements links = doc.select("a[href]");
        Elements media = doc.select("[src]");
        Elements imports = doc.select("link[href]");
        
        print("\nMedia: (%d)", media.size());
        for (Element src : media) {
            if (src.tagName().equals("img"))
                print(" * %s: <%s> %sx%s (%s)",
                        src.tagName(), src.attr("abs:src"), src.attr("width"), src.attr("height"),
                        trim(src.attr("alt"), 20));
            else
                print(" * %s: <%s>", src.tagName(), src.attr("abs:src"));
        }
        print("\nImports: (%d)", imports.size());
        
        for (Element link : imports) {
            print(" * %s <%s> (%s)", link.tagName(),link.attr("abs:href"), link.attr("rel"));
        }
        print("\nLinks: (%d)", links.size());
        
        for (Element link : links) {
            print(" * a: <%s>  (%s)", link.attr("abs:href"), trim(link.text(), 35));
        }
    }
    
    private static void print(String msg, Object... args) {
        System.out.println(String.format(msg, args));
    }
    
    private static String trim(String s, int width) {
        if (s.length() > width)
            return s.substring(0, width-1) + ".";
        else
            return s;
    }
}

示例输入结果

Fetching http://news.ycombinator.com/...
Media: (38)
 * img: <http://ycombinator.com/images/y18.gif> 18x18 ()
 * img: <http://ycombinator.com/images/s.gif> 10x1 ()
 
Imports: (2)
 * link <http://ycombinator.com/news.css> (stylesheet)
 * link <http://ycombinator.com/favicon.ico> (shortcut icon)
 
Links: (141)
 * a: <http://ycombinator.com>  ()
 * a: <http://news.ycombinator.com/news>  (Hacker News)

Jsoup的SSL扩展

现在很多站点都是SSL对数据传输进行加密，这也让普通的HttpConnection无法正常的获取该页面的内容，而Jsoup本身也对次没有做出相应的处理，只是留下来了一个粗糙的使用证书配置什么的方法进行解决。想了一下是否可以让Jsoup可以识别所有的SSL加密过的页面，查询了一些资料，发现可以为本地HttpsURLConnection配置一个“万能证书”，其原理是就是：

重置HttpsURLConnection的DefaultHostnameVerifier，使其对任意站点进行验证时都返回true
重置httpsURLConnection的DefaultSSLSocketFactory，使其生成随机证书

代码实现

package org.hanmeis.common.html;

import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.FileWriter;
import java.io.IOException;
import java.net.URL;
import java.util.LinkedList;
import java.util.List;

/**
 * Created by zhao.wu on 2016/12/2.
 * 该爬虫用于爬去奇书网的玄幻小说类表
 */
public class QiShuuListSpider {
    //用于保存小说信息的列表
    static List<NovelDir> novelDirs = new LinkedList<>();
    
    public static void main(String[] args) throws IOException {
        //解析过程
        URL index = new URL("http://www.qisuu.com/soft/sort02/");
        parsePage(index);
        //将信息存档
        FileWriter writer = new FileWriter("qishu.txt");
        for (NovelDir novelDir : novelDirs) {
            writer.write(novelDir.toString());
        }
        writer.close();
    }
    
    static void parsePage(URL url){
        try {
             //使用Jsoup的解析方法进行填装Dom
            Document doc = Jsoups.parse(url, 1000);
            
            //获取小说列表
            Elements novelList = doc.select(".listBox li");
            
            for (Element element : novelList) {
                NovelDir dir = new NovelDir();
                
                //获取小说作者
                Element authorElement = element.select(".s a").first();
                if(authorElement!=null) {
                    dir.setAuthor(authorElement.html());
                }
                
                //获取小说描述
                Element descriElement = element.select(".u").first();
                if(descriElement!=null) {
                    dir.setDescription(descriElement.html());
                }
                
                //获取标题、目录地址和封面
                Element titleElement = element.select("a").last();
                if(titleElement!=null) {
                    dir.setTitle(titleElement.html());
                    dir.setIndexUrl(titleElement.attr("abs:href"));
                    Element imageElement = titleElement.select("img").first();
                    if(imageElement!=null) {
                        dir.setHeadPic(imageElement.attr("src"));
                    }
                }
                
                System.out.println(dir);
                novelDirs.add(dir);
            }
            
            //获取下一页的地址，并进行请求
            Elements pageDiv = doc.select(".tspage a");
            for (Element element : pageDiv) {
                if(element.html().equals("下一页")){
                    //使用“abs:href"获取该页面的绝对地址
                    String path = element.attr("abs:href");
                    
                    //由于该站点做了请求频率限制，过快的请求会遭到暂时屏蔽，所以要细水长流的的慢慢请求
                    Thread.sleep(2000);
                    parsePage(new URL(path));
                }
            }
        } catch (IOException e) {
            System.out.println(url);
            e.printStackTrace();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
    
    /**
     * 小说MATE数据对象
     */
    static class NovelDir{
        //封面
        private String headPic;
        //作者
        private String author;
        //标题
        private String title;
        //目录地址
        private String indexUrl;
        //大概描述
        private String description;
        //getter, setter toString
    }
}

SSL扩展代码

package org.hanmeis.common.html;

import org.jsoup.Connection;
import org.jsoup.helper.HttpConnection;
import org.jsoup.nodes.Document;
import javax.net.ssl.*;
import java.io.IOException;
import java.net.URL;
import java.security.SecureRandom;
import java.security.cert.CertificateException;
import java.security.cert.X509Certificate;

/**
 * Created by zhao.wu on 2016/11/29.
 */
public class Jsoups{
    static{
        try {
            //重置HttpsURLConnection的DefaultHostnameVerifier，使其对任意站点进行验证时都返回true
            HttpsURLConnection.setDefaultHostnameVerifier(new HostnameVerifier() {
                public boolean verify(String hostname, SSLSession session) {
                    return true;
                }
            });
            
            //创建随机证书生成工厂
            SSLContext context = SSLContext.getInstance("TLS");
            context.init(null, new X509TrustManager[] { new X509TrustManager() {
                public void checkClientTrusted(X509Certificate[] chain, String authType) throws CertificateException {
                }
                public void checkServerTrusted(X509Certificate[] chain, String authType) throws CertificateException {
                }
                public X509Certificate[] getAcceptedIssuers() {
                    return new X509Certificate[0];
                }
            } }, new SecureRandom());
            
            //重置httpsURLConnection的DefaultSSLSocketFactory， 使其生成随机证书
            HttpsURLConnection.setDefaultSSLSocketFactory(context.getSocketFactory());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    
    /**
     * 使用ssl的方式去获取远程html的dom， 
     * 该方法在功能上与Jsoup本身的转换工具一样，
     * 仅仅是用来告诉代码阅读者这个方法已经对SSL进行了扩展
     * @param url 需要转换的页面地址
     * @param timeoutMillis 请求超市时间
     * @return 该页面的dom树
     * @throws IOException 请求异常或者转换异常时抛出
     */
    public static Document parse(URL url, int timeoutMillis) throws IOException {
        Connection con = HttpConnection.connect(url);
        con.timeout(timeoutMillis);
        return con.get();
    }   
}

参看文章

官网地址：https://jsoup.org

官网测试：https://try.jsoup.org

官网指南：https://jsoup.org/cookbook/、中文版：http://www.open-open.com/jsoup/

基于Jsoup实现的简单爬虫：https://blog.csdn.net/WuZuoDingFeng/article/details/53539402

未经允许请勿转载：程序喵 » 网络爬虫技术Jsoup ——快速掌握

程序喵