如何从HTML字符串中提取特定div元素的data-template属性值及其内容？-html教程-PHP中文网

如何从HTML字符串中提取特定div元素的data-template属性值及其内容？

高效提取html字符串中的特定数据

本文介绍如何从一段HTML字符串中提取特定结构的数据。假设HTML包含多个<div>元素，每个元素都具有<code>class="template_content"和data-template属性。我们的目标是从这段HTML中提取这些<div>元素的<code>data-template属性值及其内容。

例如，我们有如下HTML字符串：

<div class="template_content" data-template="template1">
  ...<div>内容1aaa</div><div>内容1bbb</div>...
</div>
<h3>标题1</h3>
<div class="template_content" data-template="template2">
  <p>内容2</p>
</div>
<h3>标题2</h3>
<div class="template_content" data-template="template3">
  <p>内容3</p><p><span>立即学习</span>“<a href="https://pan.quark.cn/s/cb6835dc7db1" style="text-decoration: underline !important; color: blue; font-weight: bolder;" rel="nofollow" target="_blank">前端免费学习笔记（深入）</a>”；</p>
</div>
<h3>标题3</h3>
<div class="template_content" data-template="template4">
  <p>内容4</p>
</div>

我们需要提取以下格式的数据：

<code>{ "data-template": "(提取内容1)", "content": "(提取内容2)" }</code>

其中，“提取内容1”对应data-template属性值，“提取内容2”对应<div>标签包含的内容。 <p>虽然可以使用正则表达式，但为了更稳健地处理HTML内容，建议使用DOM解析器。以下JavaScript代码演示了如何使用DOMParser实现这一目标：</p><div class="aritcle_card flexRow"> <div class="artcardd flexRow"> <a class="aritcle_card_img" href="/ai/923" title="知我AI"><img src="https://img.php.cn/upload/ai_manual/000/000/000/175679997247874.jpg" alt="知我AI" onerror="this.onerror='';this.src='/static/lhimages/moren/morentu.png'" ></a> <div class="aritcle_card_info flexColumn"> <a href="/ai/923" title="知我AI">知我AI</a> <p>一款多端AI知识助理，通过一键生成播客/视频/文档/网页文章摘要、思维导图，提高个人知识获取效率；自动存储知识，通过与知识库聊天，提高知识利用效率。</p> </div> <a href="/ai/923" title="知我AI" class="aritcle_card_btn flexRow flexcenter"><b></b><span>下载</span> </a> </div> </div> <pre class="brush:php;toolbar:false;">let html = ` <div class="template_content" data-template="template1"> ...<div>内容1aaa</div><div>内容1bbb</div>... </div> <h3>标题1</h3> <div class="template_content" data-template="template2"> <p>内容2</p> </div> <h3>标题2</h3> <div class="template_content" data-template="template3"> <p>内容3</p><p><span>立即学习</span>“<a href="https://pan.quark.cn/s/cb6835dc7db1" style="text-decoration: underline !important; color: blue; font-weight: bolder;" rel="nofollow" target="_blank">前端免费学习笔记（深入）</a>”；</p> </div> <h3>标题3</h3> <div class="template_content" data-template="template4"> <p>内容4</p> </div> `; const parser = new DOMParser(); const doc = parser.parseFromString(html, 'text/html'); const divs = doc.querySelectorAll('div.template_content'); const extractedData = []; divs.forEach(div => { const template = div.getAttribute('data-template'); const content = div.innerHTML; extractedData.push({ "data-template": template, "content": content }); }); console.log(extractedData);</pre> <p>这段代码首先使用<code>DOMParser将HTML字符串解析成DOM树，然后使用querySelectorAll选择所有具有class="template_content"的<div>元素。最后，它遍历每个元素，提取<code>data-template属性值和innerHTML内容，并将它们存储在一个数组中。这种方法比正则表达式更可靠，因为它能够正确处理复杂的HTML结构，避免因HTML内容变化而导致的错误。