php – 针对同一XML模式(XSD)加速一批XML文件的XML模式验证

我想加快针对同一个XML架构(XSD)验证一批XML文件的过程.只有我在PHP环境中的限制.

我目前的问题是我想要验证的模式包括相当复杂的2755行的xhtml模式(http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd).
即使对于非常简单的数据,这也需要很长时间(大约30秒pr.验证).
由于我的批处理中有数千个XML文件,因此实际上并不能很好地扩展.

为了验证XML文件,我使用了标准php-xml库中的这两种方法.

> DOMDocument :: schemaValidate
> DOMDocument :: schemaValidateSource

我认为PHP实现通过HTTP获取XHTML模式并构建一些内部表示(可能是DOMDocument),并且在验证完成时抛弃它.我在想,XML-libs的一些选项可能会改变这种行为,以便在此过程中缓存某些内容以供重用.

我已经构建了一个简单的测试设置来说明我的问题:

test-schema.xsd

<xs:schema attributeFormDefault="unqualified"
    elementFormDefault="qualified"
    targetNamespace="http://myschema.example.com/"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:myschema="http://myschema.example.com/"
    xmlns:xhtml="http://www.w3.org/1999/xhtml">
    <xs:import
        schemaLocation="http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd"
        namespace="http://www.w3.org/1999/xhtml">
    </xs:import>
    <xs:element name="Root">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="MyHTMLElement">
                    <xs:complexType>
                        <xs:complexContent>
                            <xs:extension base="xhtml:Flow"></xs:extension>
                        </xs:complexContent>
                    </xs:complexType>
                </xs:element>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
</xs:schema>

test-data.xml

<?xml version="1.0" encoding="UTF-8"?>
<Root xmlns="http://myschema.example.com/" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:xml="http://www.w3.org/XML/1998/namespace" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://myschema.example.com/ test-schema.xsd ">
  <MyHTMLElement>
    <xhtml:p>This is an XHTML paragraph!</xhtml:p>
  </MyHTMLElement>
</Root>

schematest.php

<?php
$data_dom = new DOMDocument();
$data_dom->load('test-data.xml');

// Multiple validations using the schemaValidate method.
for ($attempt = 1; $attempt <= 3; $attempt++) {
    $start = time();
    echo "schemaValidate: Attempt #$attempt returns ";
    if (!$data_dom->schemaValidate('test-schema.xsd')) {
        echo "Invalid!";
    } else {
        echo "Valid!";
    }
    $end = time();
    echo " in " . ($end-$start) . " seconds.\n";
}

// Loading schema into a string.
$schema_source = file_get_contents('test-schema.xsd');

// Multiple validations using the schemaValidate method.
for ($attempt = 1; $attempt <= 3; $attempt++) {
    $start = time();
    echo "schemaValidateSource: Attempt #$attempt returns ";
    if (!$data_dom->schemaValidateSource($schema_source)) {
        echo "Invalid!";
    } else {
        echo "Valid!";
    }
    $end = time();
    echo " in " . ($end-$start) . " seconds.\n";
}

运行此schematest.php文件将生成以下输出:

schemaValidate: Attempt #1 returns Valid! in 30 seconds.
schemaValidate: Attempt #2 returns Valid! in 30 seconds.
schemaValidate: Attempt #3 returns Valid! in 30 seconds.
schemaValidateSource: Attempt #1 returns Valid! in 32 seconds.
schemaValidateSource: Attempt #2 returns Valid! in 30 seconds.
schemaValidateSource: Attempt #3 returns Valid! in 30 seconds.

如何解决这个问题的任何帮助和建议都非常欢迎!

最佳答案
您可以安全地从时间值减去30秒作为开销.

对W3C服务器的远程请求正在被延迟,因为大多数库都没有反映缓存文档(甚至HTTP标头也提示).但是read your own

The W3C servers are slow to return DTDs. Is the delay intentional?

Yes. Due to various software systems downloading DTDs from our site millions of times a day (despite the caching directives of our servers), we have started to serve DTDs and schema (DTD, XSD, ENT, MOD, etc.) from our site with an artificial delay. Our goals in doing so are to bring more attention to our ongoing issues with excessive DTD traffic, and to protect the stability and response time of the rest of our site. We recommend HTTP caching or catalog files to improve performance.

W3.org试图保持低要求.这是可以理解的. PHP的DomDocument基于libxml. libxml允许设置外部实体加载器.整个Catalog support section在这种情况下很有意思.

要解决相关问题,请设置catalog.xml文件:

<?xml version="1.0"?>
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
    <system systemId="http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd"
            uri="xhtml1-transitional.xsd"/>
    <system systemId="http://www.w3.org/2001/xml.xsd"
            uri="xml.xsd"/>
</catalog>

使用目录旁边的目录文件中给出的名称保存两个.xsd文件的副本(相对和绝对路径文件:/// …如果您喜欢不同的目录,请执行此操作).

然后确保将系统环境变量XML_CATALOG_FILES设置为catalog.xml文件的文件名.设置完所有内容后,验证只会贯穿:

schemaValidate: Attempt #1 returns Valid! in 0 seconds.
schemaValidate: Attempt #2 returns Valid! in 0 seconds.
schemaValidate: Attempt #3 returns Valid! in 0 seconds.
schemaValidateSource: Attempt #1 returns Valid! in 0 seconds.
schemaValidateSource: Attempt #2 returns Valid! in 0 seconds.
schemaValidateSource: Attempt #3 returns Valid! in 0 seconds.

如果它仍然需要很长时间,那只是环境变量未设置到正确位置的标志.我在博客文章中处理了变量以及一些边缘情况:

> Using Catalogs for Validation with PHP’s DOMDocument and Libxml2.

它应该处理各种边缘情况,例如包含空格的文件名.

或者,可以创建一个使用URL =>的简单外部实体加载器回调函数.以数组形式的本地文件系统的文件映射:

$mapping = [
     'http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd'
         => 'schema/xhtml1-transitional.xsd',

     'http://www.w3.org/2001/xml.xsd'                          
         => 'schema/xml.xsd',
];

如图所示,我将这两个XSD文件的逐字副本放在一个名为schema的子目录中.下一步是利用libxml_set_external_entity_loader通过映射激活回调函数.磁盘上存在的文件已经是首选并直接加载.如果例程遇到没有映射的非文件,则会抛出RuntimeException并带有详细消息:

libxml_set_external_entity_loader(
    function ($public, $system, $context) use ($mapping) {

        if (is_file($system)) {
            return $system;
        }

        if (isset($mapping[$system])) {
            return __DIR__ . '/' . $mapping[$system];
        }

        $message = sprintf(
            "Failed to load external entity: Public: %s; System: %s; Context: %s",
            var_export($public, 1), var_export($system, 1),
            strtr(var_export($context, 1), [" (\n  " => '(', "\n " => '', "\n" => ''])
        );

        throw new RuntimeException($message);
    }
);

设置此外部实体加载程序后,不再有远程请求的延迟.

就是这样.请注意:此外部实体加载程序已编写用于加载XML文件以从磁盘验证并将XSD URI“解析”为本地文件名.其他类型的操作(例如基于DTD的验证)可能需要一些代码更改/扩展.更优选的是XML目录.它也适用于不同的工具.

转载注明原文:php – 针对同一XML模式(XSD)加速一批XML文件的XML模式验证 - 代码日志