Class HtmlBuilder


  • public class HtmlBuilder
    extends nu.xom.Builder
    This class implements an HTML5 parser that exposes data through the XOM interface.

    By default, when using the constructor without arguments, the this parser coerces XML 1.0-incompatible infosets into XML 1.0-compatible infosets. This corresponds to ALTER_INFOSET as the general XML violation policy. It is possible to treat XML 1.0 infoset violations as fatal by setting the general XML violation policy to FATAL.

    The doctype is not represented in the tree.

    The document mode is represented via the Mode interface on the Document node if the node implements that interface (depends on the used node factory).

    The form pointer is stored if the node factory supports storing it.

    This package has its own node factory class because the official XOM node factory may return multiple nodes instead of one confusing the assumptions of the DOM-oriented HTML5 parsing algorithm.

    Version:
    $Id$
    Author:
    hsivonen
    • Constructor Detail

      • HtmlBuilder

        public HtmlBuilder()
        Constructor with default node factory and fatal XML violation policy.
      • HtmlBuilder

        public HtmlBuilder​(SimpleNodeFactory nodeFactory)
        Constructor with given node factory and fatal XML violation policy.
        Parameters:
        nodeFactory - the factory
      • HtmlBuilder

        public HtmlBuilder​(XmlViolationPolicy xmlPolicy)
        Constructor with default node factory and given XML violation policy.
        Parameters:
        xmlPolicy - the policy
      • HtmlBuilder

        public HtmlBuilder​(SimpleNodeFactory nodeFactory,
                           XmlViolationPolicy xmlPolicy)
        Constructor with given node factory and given XML violation policy.
        Parameters:
        nodeFactory - the factory
        xmlPolicy - the policy
    • Method Detail

      • build

        public nu.xom.Document build​(org.xml.sax.InputSource is)
                              throws nu.xom.ParsingException,
                                     java.io.IOException
        Parse from SAX InputSource.
        Parameters:
        is - the InputSource
        Returns:
        the document
        Throws:
        nu.xom.ParsingException - in case of an XML violation
        java.io.IOException - if IO goes wrang
      • buildFragment

        public nu.xom.Nodes buildFragment​(org.xml.sax.InputSource is,
                                          java.lang.String context)
                                   throws java.io.IOException,
                                          nu.xom.ParsingException
        Parse a fragment from SAX InputSource.
        Parameters:
        is - the InputSource
        context - the name of the context element
        Returns:
        the fragment
        Throws:
        nu.xom.ParsingException - in case of an XML violation
        java.io.IOException - if IO goes wrang
      • build

        public nu.xom.Document build​(java.io.File file)
                              throws nu.xom.ParsingException,
                                     nu.xom.ValidityException,
                                     java.io.IOException
        Parse from File.
        Overrides:
        build in class nu.xom.Builder
        Parameters:
        file - the file
        Returns:
        the document
        Throws:
        nu.xom.ParsingException - in case of an XML violation
        java.io.IOException - if IO goes wrang
        nu.xom.ValidityException
        See Also:
        Builder.build(java.io.File)
      • build

        public nu.xom.Document build​(java.io.InputStream stream,
                                     java.lang.String uri)
                              throws nu.xom.ParsingException,
                                     nu.xom.ValidityException,
                                     java.io.IOException
        Parse from InputStream.
        Overrides:
        build in class nu.xom.Builder
        Parameters:
        stream - the stream
        uri - the base URI
        Returns:
        the document
        Throws:
        nu.xom.ParsingException - in case of an XML violation
        java.io.IOException - if IO goes wrang
        nu.xom.ValidityException
        See Also:
        Builder.build(java.io.InputStream, java.lang.String)
      • build

        public nu.xom.Document build​(java.io.InputStream stream)
                              throws nu.xom.ParsingException,
                                     nu.xom.ValidityException,
                                     java.io.IOException
        Parse from InputStream.
        Overrides:
        build in class nu.xom.Builder
        Parameters:
        stream - the stream
        Returns:
        the document
        Throws:
        nu.xom.ParsingException - in case of an XML violation
        java.io.IOException - if IO goes wrang
        nu.xom.ValidityException
        See Also:
        Builder.build(java.io.InputStream)
      • build

        public nu.xom.Document build​(java.io.Reader stream,
                                     java.lang.String uri)
                              throws nu.xom.ParsingException,
                                     nu.xom.ValidityException,
                                     java.io.IOException
        Parse from Reader.
        Overrides:
        build in class nu.xom.Builder
        Parameters:
        stream - the reader
        uri - the base URI
        Returns:
        the document
        Throws:
        nu.xom.ParsingException - in case of an XML violation
        java.io.IOException - if IO goes wrang
        nu.xom.ValidityException
        See Also:
        Builder.build(java.io.Reader, java.lang.String)
      • build

        public nu.xom.Document build​(java.io.Reader stream)
                              throws nu.xom.ParsingException,
                                     nu.xom.ValidityException,
                                     java.io.IOException
        Parse from Reader.
        Overrides:
        build in class nu.xom.Builder
        Parameters:
        stream - the reader
        Returns:
        the document
        Throws:
        nu.xom.ParsingException - in case of an XML violation
        java.io.IOException - if IO goes wrang
        nu.xom.ValidityException
        See Also:
        Builder.build(java.io.Reader)
      • build

        public nu.xom.Document build​(java.lang.String content,
                                     java.lang.String uri)
                              throws nu.xom.ParsingException,
                                     nu.xom.ValidityException,
                                     java.io.IOException
        Parse from String.
        Overrides:
        build in class nu.xom.Builder
        Parameters:
        content - the HTML source as string
        uri - the base URI
        Returns:
        the document
        Throws:
        nu.xom.ParsingException - in case of an XML violation
        java.io.IOException - if IO goes wrang
        nu.xom.ValidityException
        See Also:
        Builder.build(java.lang.String, java.lang.String)
      • build

        public nu.xom.Document build​(java.lang.String uri)
                              throws nu.xom.ParsingException,
                                     nu.xom.ValidityException,
                                     java.io.IOException
        Parse from URI.
        Overrides:
        build in class nu.xom.Builder
        Parameters:
        uri - the URI of the document
        Returns:
        the document
        Throws:
        nu.xom.ParsingException - in case of an XML violation
        java.io.IOException - if IO goes wrang
        nu.xom.ValidityException
        See Also:
        Builder.build(java.lang.String)
      • getSimpleNodeFactory

        public SimpleNodeFactory getSimpleNodeFactory()
        Gets the node factory
      • setEntityResolver

        public void setEntityResolver​(org.xml.sax.EntityResolver resolver)
        See Also:
        XMLReader.setEntityResolver(org.xml.sax.EntityResolver)
      • setErrorHandler

        public void setErrorHandler​(org.xml.sax.ErrorHandler handler)
        See Also:
        XMLReader.setErrorHandler(org.xml.sax.ErrorHandler)
      • setTransitionHander

        public void setTransitionHander​(TransitionHandler handler)
      • isCheckingNormalization

        public boolean isCheckingNormalization()
        Indicates whether NFC normalization of source is being checked.
        Returns:
        true if NFC normalization of source is being checked.
        See Also:
        nu.validator.htmlparser.impl.Tokenizer#isCheckingNormalization()
      • setCheckingNormalization

        public void setCheckingNormalization​(boolean enable)
        Toggles the checking of the NFC normalization of source.
        Parameters:
        enable - true to check normalization
        See Also:
        nu.validator.htmlparser.impl.Tokenizer#setCheckingNormalization(boolean)
      • isScriptingEnabled

        public boolean isScriptingEnabled()
        Whether the parser considers scripting to be enabled for noscript treatment.
        Returns:
        true if enabled
        See Also:
        TreeBuilder.isScriptingEnabled()
      • setScriptingEnabled

        public void setScriptingEnabled​(boolean scriptingEnabled)
        Sets whether the parser considers scripting to be enabled for noscript treatment.
        Parameters:
        scriptingEnabled - true to enable
        See Also:
        TreeBuilder.setScriptingEnabled(boolean)
      • getDoctypeExpectation

        public DoctypeExpectation getDoctypeExpectation()
        Returns the doctype expectation.
        Returns:
        the doctypeExpectation
      • getDocumentModeHandler

        public DocumentModeHandler getDocumentModeHandler()
        Returns the document mode handler.
        Returns:
        the documentModeHandler
      • getStreamabilityViolationPolicy

        public XmlViolationPolicy getStreamabilityViolationPolicy()
        Returns the streamabilityViolationPolicy.
        Returns:
        the streamabilityViolationPolicy
      • setStreamabilityViolationPolicy

        public void setStreamabilityViolationPolicy​(XmlViolationPolicy streamabilityViolationPolicy)
        Sets the streamabilityViolationPolicy.
        Parameters:
        streamabilityViolationPolicy - the streamabilityViolationPolicy to set
      • setHtml4ModeCompatibleWithXhtml1Schemata

        public void setHtml4ModeCompatibleWithXhtml1Schemata​(boolean html4ModeCompatibleWithXhtml1Schemata)
        Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.
        Parameters:
        html4ModeCompatibleWithXhtml1Schemata -
      • getDocumentLocator

        public org.xml.sax.Locator getDocumentLocator()
        Returns the Locator during parse.
        Returns:
        the Locator
      • isHtml4ModeCompatibleWithXhtml1Schemata

        public boolean isHtml4ModeCompatibleWithXhtml1Schemata()
        Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.
        Returns:
        the html4ModeCompatibleWithXhtml1Schemata
      • setMappingLangToXmlLang

        public void setMappingLangToXmlLang​(boolean mappingLangToXmlLang)
        Whether lang is mapped to xml:lang.
        Parameters:
        mappingLangToXmlLang -
        See Also:
        Tokenizer.setMappingLangToXmlLang(boolean)
      • isMappingLangToXmlLang

        public boolean isMappingLangToXmlLang()
        Whether lang is mapped to xml:lang.
        Returns:
        the mappingLangToXmlLang
      • getXmlnsPolicy

        public XmlViolationPolicy getXmlnsPolicy()
        Returns the xmlnsPolicy.
        Returns:
        the xmlnsPolicy
      • getCommentPolicy

        public XmlViolationPolicy getCommentPolicy()
        Returns the commentPolicy.
        Returns:
        the commentPolicy
      • getContentNonXmlCharPolicy

        public XmlViolationPolicy getContentNonXmlCharPolicy()
        Returns the contentNonXmlCharPolicy.
        Returns:
        the contentNonXmlCharPolicy
      • getContentSpacePolicy

        public XmlViolationPolicy getContentSpacePolicy()
        Returns the contentSpacePolicy.
        Returns:
        the contentSpacePolicy
      • isReportingDoctype

        public boolean isReportingDoctype()
        Returns the reportingDoctype.
        Returns:
        the reportingDoctype
      • setHeuristics

        public void setHeuristics​(Heuristics heuristics)
        Sets the encoding sniffing heuristics.
        Parameters:
        heuristics - the heuristics to set
        See Also:
        nu.validator.htmlparser.impl.Tokenizer#setHeuristics(nu.validator.htmlparser.common.Heuristics)
      • getHeuristics

        public Heuristics getHeuristics()
      • setXmlPolicy

        public void setXmlPolicy​(XmlViolationPolicy xmlPolicy)
        This is a catch-all convenience method for setting name, xmlns, content space, content non-XML char and comment policies in one go. This does not affect the streamability policy or doctype reporting.
        Parameters:
        xmlPolicy -
      • getNamePolicy

        public XmlViolationPolicy getNamePolicy()
        The policy for non-NCName element and attribute names.
        Returns:
        the namePolicy
      • setBogusXmlnsPolicy

        public void setBogusXmlnsPolicy​(XmlViolationPolicy bogusXmlnsPolicy)
        Deprecated.
        Does nothing.
      • getBogusXmlnsPolicy

        public XmlViolationPolicy getBogusXmlnsPolicy()
        Deprecated.
        Returns XmlViolationPolicy.ALTER_INFOSET.
        Returns:
        XmlViolationPolicy.ALTER_INFOSET
      • addCharacterHandler

        public void addCharacterHandler​(CharacterHandler characterHandler)
      • setIgnoringComments

        public void setIgnoringComments​(boolean ignoreComments)
        Sets whether comment nodes appear in the tree.
        Parameters:
        ignoreComments - true to ignore comments
        See Also:
        TreeBuilder.setIgnoringComments(boolean)