一个结果生成器,用于构建 HTML 解析器并将 HTML 元素转换为强类型结果,灵感来自 RegexBuilder

HTMLParserBuilder

一个结果生成器,用于构建 HTML 解析器并将 HTML 元素转换为强类型结果,灵感来自 RegexBuilder。

注意:是从 apple/swift-experimental-string-processing 复制而来的。CaptureTransform.swiftTypeConstruction.swift

安装

要求

  • 斯威夫特 5.7 (buildPartialBlock)
  • macOS 10.15
  • iOS 13.0
  • 电视操作系统 13.0
  • 手表操作系统 6.0

注意:HTMLParserBuilder目前只支持支持Objective-C运行时的平台,因为它依赖于HTMLKit,一个Objective-C HTML解析器库。

dependencies: [
    // ...
    .package(name: "HTMLParserBuilder", url: "https://github.com/danny1113/html-parser-builder.git", from: "1.0.0")
]

介绍

解析 HTML 可能很复杂,例如,您要解析下面的简单 html:

<h1 id="hello">hello, world</h1>

<div id="group">
    <h1>INSIDE GROUP h1</h1>
    <h2>INSIDE GROUP h2</h2>
</div>

现有的 HTML 解析库有以下缺点:

  • 命名每个捕获的元素
  • 随着您要捕获的元素变得越来越复杂,它可能会更加复杂
  • Error handling can be hard

let htmlString = "<html>...</html>"
let doc = HTMLDocument(string: htmlString)
let first = doc.querySelector("#hello")?.textContent

let group = doc.querySelector("#group")
let second = group?.querySelector("h1")?.textContent
let third = group?.querySelector("h2")?.textContent

if  let first = first,
    let second = second,
    let third = third {
    
    // ...
} else {
    // ...
}

HTMLParserBuilder comes with some really great advantages:

  • Strongly-typed capture result
  • Structrued syntax
  • Composible API
  • Support for async await
  • Error handling built in

You can construct your parser which reflect your original HTML structure:

let capture = HTML {
    TryCapture("#hello") { (element: HTMLElement?) -> String? in
        return element?.textContent
    } // => HTML<String?>
    
    Local("#group") {
        Capture("h1", transform: \.textContent) // => HTML<String>
        Capture("h2", transform: \.textContent) // => HTML<String>
    } // => HTML<(String, String)>
    
} // => HTML<(String?, String, String)>


let htmlString = "<html>...</html>"
let doc = HTMLDocument(string: htmlString)

let output = try doc.parse(capture)
// => (String?, String, String)
// output: (Optional("hello, world"), "INSIDE GROUP h1", "INSIDE GROUP h2")

Note: You can now compose up to 10 components inside the builder, but you can group your captures inside Local as a workaround.

API Detail Usage

Parsing

HTMLParserBuilder provides 2 functions for parsing:

public func parse<Output>(_ html: HTML<Output>) throws -> Output
public func parse<Output>(_ html: HTML<Output>) async throws -> Output

Note: You can choose the async version for even better performance, since it use structured concurrency to parallelize child tasks.

HTML

You can construct your parser inside , it can also transform to other data type.HTML

struct Group {
    let h1: String
    let h2: String
}

let capture = HTML {
    Capture("#group h1", transform: \.textContent) // => HTML<String>
    Capture("#group h2", transform: \.textContent) // => HTML<String>
    
} transform: { (output: (String, String)) -> Group in
    return Group(
        h1: output.0,
        h2: output.1
    )
} // => HTML<Group>

Capture

Using is the same as , you pass in CSS selector to find the HTML element, and you can transform it to any other type you want:CapturequerySelector

  • innerHTML
  • textContent
  • attributes

Note: If can’t find the HTML element that match the selector, it will throw an error cause the whole parse fail, for failable capture, see TryCapture.Capture

You can use this API with various declaration that is most suitable for you:

Capture("#hello", transform: \.textContent)
Capture("#hello") { $0.textContent }
Capture("#hello") { (e: HTMLElement) -> String in
    return e.textContent
}

TryCapture

TryCapture is a litte different from , it also calls to find the HTML element, but it returns an optional HTML element.CapturequerySelector

For this example, it will produce the result type of , and the result will be when the HTML element can’t be found.String?nil

TryCapture("#hello") { (e: HTMLElement?) -> String? in
    return e?.innerHTML
}

CaptureAll

Using is the same as , you pass in CSS selector to find all HTML elements that match the selector, and you can transform it to any other type you want:CaptureAllquerySelectorAll

You can use this API with various declaration that is most suitable for you:

CaptureAll("h1") { $0.map(\.textContent) }
CaptureAll("h1") { (e: [HTMLElement]) -> [String] in
    return e.map(\.textContent)
}

You can also capture other elements inside and transform to other type:

<div class="group">
    <h1>Group 1</h1>
</div>
<div class="group">
    <h1>Group 2</h1>
</div>

CaptureAll("div.group") { (elements: [HTMLElement]) -> [String] in
    return elements.compactMap { e in
        return e.querySelector("h1")?.textContent
    }
}
// => [String]
// output: ["Group 1", "Group 2"]

Local

Local will find a HTML element that match the selector, and all the captures inside will find its element based on the element found by , this is useful when you just want to capture element that is inside the local group.Local

Just like , can also transform captured result to other data type by adding :HTMLLocaltransform

struct Group {
    let h1: String
    let h2: String
}

Local("#group") {
    Capture("h1", transform: \.textContent) // => HTML<String>
    Capture("h2", transform: \.textContent) // => HTML<String>
} transform: { (output: (String, String)) -> Group in
    return Group(
        h1: output.0,
        h2: output.1
    )
} // => Group

Note: If can’t find the HTML element that match the selector, it will throw an error cause the whole parse fail, you can use TryCapture as alternative.Local

LateInit

This library also comes with a handy property wrapper: , which can delay the initialization until the first time you access it.LateInit

struct Container {
    @LateInit var capture = HTML {
        Capture("h1", transform: \.textContent)
    }
}

// it needs to be `var` to perform late initialization
var container = Container()
let output = doc.parse(container.capture)
// ...

Wrap Up

API Use Case
Capture Throws error when element can’t be captured
TryCapture Returns when element can’t be capturednil
CaptureAll Capture all elements match the selector
Local Capture elements in the local scope
LateInit Delay the initialization to first time you access it

Advanced use case

  • Pass into anotherHTMLComponent
  • Transform to custom data structure before parasing

struct Group {
    let h1: String
    let h2: String
}

//       |--------------------------------------------------------------|
let groupCapture = HTML {                                            // |
    Local("#group") {                                                // |
        Capture("h1", transform: \.textContent) // => HTML<String>   // |
        Capture("h2", transform: \.textContent) // => HTML<String>   // |
    } // => HTML<(String, String)>                                   // |
                                                                     // |
} transform: { output -> Group in                                    // |
    return Group(                                                    // |
        h1: output.0,                                                // |
        h2: output.1                                                 // |
    )                                                                // |
} // => HTML<Group>                                                  // |
                                                                     // |
let capture = HTML {                                                 // |
    TryCapture("#hello") { (element: HTMLElement?) -> String? in     // |
        return element?.textContent                                  // |
    } // => HTML<String?>                                            // |
                                                                     // |
    groupCapture // => HTML<Group> -------------------------------------|
    
} // => HTML<(String?, Group)>


let htmlString = "<html>...</html>"
let doc = HTMLDocument(string: htmlString)

let output = try doc.parse(capture)
// => (String?, Group)

GitHub

点击跳转