Mixp (Guile interface to Expat) manual

For use with Mixp 0.20010812

Thierry Bézecourt


1 Introduction

Mixp is a Scheme interface to James Clark's expat library. It may be used to parse XML documents with Guile.

If you do not know expat, first have a look at the sample program See section 1.1 Sample programs. Typically, you will create a parser object with expat:parser-create, then associate one or more handlers to it (usually with expat:set-element-handler and expat:set-character-data-handler), then parse the document with expat:parse or mixp:parse-file. The most commonly used functions are documented here. You may guess how others work from their prototypes. See also the test programs in the test/ subdirectory of the distribution.

If you happen to know expat already, you will find easily what you are looking for by taking a C expat function name, replacing XML_ with expat:, using hyphens instead of capital letters to separate the words, and searching it in the reference documentation See section 2 Expat interface. In most cases, the prototype is the same, modulo the differences between C and Scheme.

Another source of information, maybe more accurate, is expat itself: expat.html, xmlparse.h, and http://www.jclark.com/xml/expatfaq.html.

1.1 Sample programs

The following sample program reads an XML file (provided with the Mixp distribution), and prints its start and end tags. You can launch a Guile shell from the samples/ directory of the distribute, and execute this code. Your GUILE_LOAD_PATH variable should contain the directory in which you installed Mixp (that is, the directory which contains the xml/ subdirectory).

(use-modules (xml expat)
             (xml mixp))

;; Create the parser object
(let ((parser (expat:parser-create)))
  ;; Specify callback functions
  (expat:set-element-handler parser
			     (lambda (p name attribs)
			       (display "start ")(display name)(newline))
			     (lambda (p name)
			       (display "end ")(display name)(newline)))
  ;; Parse the file
  (mixp:parse-file parser "REC-xml-19980210.xml"))

For more information about the Expat interface and handlers, See section 2 Expat interface.

The following sample program builds a hierarchical tree structure from an XML document which reside in a string. This tree structure should be easy to use with traditional Scheme functions.

(use-modules (xml mixp))

(let ((xml-doc "<foo name='Paul'><bar>Some text</bar><void/></foo>"))
  (display (call-with-input-string xml-doc mixp:xml->tree)))

Result is :

((element ("foo" (("name" . "Paul")))
	  (element ("bar" ())
		   (character-data "Some text"))
	  (element ("void" ()))))

For more information about this interface, See section 3 High-level extensions.

1.2 Loading Mixp

From the Guile shell or from a Guile script, you should type the following commands before using the Mixp API:

(use-modules (xml expat))
(use-modules (xml mixp))

Actually, you may load just xml:expat if you intend to use only the raw expat interface (i.e. the functions which name is prefixed by expat:, See section 2 Expat interface.) You need xml:mixp if you want to use the extension functions (See section 3 High-level extensions.)

1.3 Mixp components

Mixp contains two Scheme modules:

On another point of view, Mixp contains two files: a shared library, libexpat.so, which defines the xml:expat interface and a part of the xml interface, and a Scheme file, mixp.scm, which defines other parts of the xml:mixp interface. Both files are located in the xml directory somewhere along your GUILE_LOAD_PATH.

1.4 How to...

This section describes a few common tasks which may be solved with Mixp.

2 Expat interface

This section contains the reference documentation for the expat interface, i.e the xml:expat module.

2.1 Expat handlers

Handlers are functions called by the parser at specific points in the XML document (for example when finding a new tag, or a comment, etc). This section explains how to specify these handlers, and what their prototype should be.

Most of the time, the first argument will be user-data. user-data is either a user-specified buffer (see expat:set-user-data) or the parser object itself (see expat:use-parser-as-handler-arg).

Function: expat:set-element-handler parser start-handler end-handler
Specify callback functions to be called when an element starts or end.

parser is the parser object returned by expat:parser-create.

start-handler is a function to be called when an opening tag is found. This function will be called as follows:

(start-handler user-data name attributes)

name is the tag name. attributes is an association list which contains the attributes with their values.

end-handler is a function to be called when an closing tag is found (<foo>). This function will be called as follows:

(end-handler user-data name)

The arguments have the same meaning as in the start handler.

Function: expat:set-character-data-handler parser handler
Called for normal text (i.e outside <> tags).

(handler user-data value)

value is the text encoded in UTF-8.

Function: expat:set-processing-instruction-handler parser handler

Sets the processing instruction handler, which should have the following prototype:

(handler user-data pi-data)

This handler will be called by Mixp every time it finds a processing instruction (<? ... ?>).

Function: expat:set-comment-handler parser handler

Sets the comment handler, which should have the following prototype:

(handler user-data comment-data)

This handler will be called by Mixp every time it finds a comment (<!-- ... -->).

Function: expat:set-cdata-section-handler parser start-handler end-handler

Sets the CDATA section handler, which should have the following prototype:

(start-handler user-data)
(end-handler user-data)

This handler will be called by Mixp every time it finds a CDATA section (<![CDATA[ ... ]]>).

Function: expat:set-default-handler parser handler

Sets the default handler and also inhibits expansion of internal entities. The entity reference will be passed to the default handler.

(handler user-data string)

The default handler is called for any characters in the XML document for which there is no applicable handler. This includes both characters that are part of markup which is of a kind that is not reported (comments, markup declarations), or characters that are part of a construct which could be reported but for which no handler has been supplied. The characters are passed exactly as they were in the XML document except that they will be encoded in UTF-8. Line boundaries are not normalized. Note that a byte order mark character is not passed to the default handler. There are no guarantees about how characters are divided between calls to the default handler: for example, a comment might be split between multiple calls.

Function: expat:set-default-handler-expand parser handler

Sets the default handler but does not inhibit expansion of internal entities. The entity reference will not be passed to the default handler.

(handler user-data string)

See expat:set-default-handler for a description of the handler.

Function: expat:set-unparsed-entity-decl-handler parser handler

Sets the unparsed entity declaration handler, which should have the following prototype:

(handler user-data entity-name base system-id
public-id notation-name)

The handler is called by Mixp every time it finds a declaration of an unparsed entity (`<!ENTITY Antarctica SYSTEM "http://www.antarctica.net" NDATA vrml>').

The base argument is whatever was set by expat:set-base. The entity-name, system-id and notation-name arguments will never be #f. The other arguments may be.

Function: expat:set-notation-decl-handler parser handler
Called when finding a notation decl (for example `<!NOTATION vrml PUBLIC "VRML 2">'). This function will be called as follows:

(handler user-data notation-name base system-id
public-id)

The base argument is whatever was set by expat:set-base. notation-name will never be #f. The other arguments can be.

Function: expat:set-namespace-decl-handler parser start-handler end-handler

Sets the Namespace declaration handler, which should have the following prototype:

(start-namespace-decl-handler user-data prefix uri)
(end-namespace-decl-handler user-data prefix uri)

When namespace processing is enabled, these are called once for each namespace declaration. The call to the start and end element handlers occur between the calls to the start and end namespace declaration handlers. For an xmlns attribute, prefix will be null. For an xmlns="" attribute, uri will be null.

Function: expat:set-not-standalone-handler parser handler

Sets the Standalone declartion handler, which should have the following prototype:

(not-standalone-handler user-data)

This is called if the document is not standalone (it has an external subset or a reference to a parameter entity, but does not have standalone="yes"). If this handler returns 0, then processing will not continue, and the parser will return a expat:XML_ERROR_NOT_STANDALONE error.

Function: expat:set-external-entity-ref-handler parser handler

(external-entity-ref-handler user-data context base
system-id public-id)

The external entity reference handler is called by Mixp when it finds a reference to an external entity in the document. For example, the <!DOCTYPE ...> declaration contains an external entity reference when it specifies an external DTD. In that case, you should also call (expat:set-param-entity-parsing parser) , because you probably want Mixp to expand the references to entities declared in your DTD. See See section 1.4 How to... for an example.

The external entity reference handler should return an open port to the external entity. For example, assuming that system-id refers to a relative file path, you may define the handler as follows:


(lambda (my-parser
	 context
	 base
	 system-id
	 public-id)
  (display (format "Reference to external entity: ~A.\n"
		   system-id))
  (open-file system-id "r"))

The system identifier is defined by the XML specification as a URI. Therefore, the example above will only work if you know that the system id is actually a file path. You may need to use, for example, a http library if you want to support URIs which start with "http://".

Note that the behaviour of this handler is very different in expat.

Also see expat:set-external-entity-ref-handler-arg.

Function: expat:set-unknown-encoding-handler parser handler
Unknown encoding handlers have not been really tested, so they probably don't work for now.

(unknown-encoding-handler encoding-handler-data name
info)

2.2 Other expat functions

Function: expat:parser-create encoding
Constructs a new parser; encoding is the encoding specified by the external protocol or #t if there is none specified.

Function: expat:parser-create-ns encoding namespace-separator

Function: expat:set-external-entity-ref-handler-arg arg
If a non-nil value for arg is specified here, then it will be passed as the first argument to the external entity ref handler instead of the parser object.

Function: expat:default-current

Function: expat:set-user-data parser data
Associates a value to the parser. This value will be passed as the user-data argument to callbacks, unless expat:use-parser-as-handler-arg has been called. This value can be any Scheme value.

Function: expat:get-user-data parser
Returns the user data associated with the parser object.

Function: expat:set-encoding parser encoding
Encoding functions are not really tested for now.

Function: expat:use-parser-as-handler-arg
Specifies that the user-data argument passed to callbacks should be the parser object itself. User data may still be associated to the parser with expat:set-user-data and retrieved with expat:get-user-data.

Function: expat:set-base parser base

Function: expat:get-base parser

Function: expat:get-specified-attribute-count parser

Function: expat:parse parser buffer is-final
Parse some input. Returns 0 if a fatal error is detected. In that case, you should call expat:get-error-code to obtain more information. The last call to expat:parse must have is-final set to #t.

Function: expat:parse-buffer len is-final

Function: expat:get-error-code parser
Returns the error code associated with the parser. This function should be called after expat:parse has returned 0 (i.e an error). The result is a symbol which may have one of the following values:

An error message describing the error may be obtained by calling expat:error-string.

Function: expat:get-current-byte-count parser

Function: expat:get-current-line-number parser

Function: expat:get-current-column-number parser

Function: expat:get-current-byte-index parser

Function: expat:error-string code
Returns an error message describing the expat error code defined by code. code is usually obtained by calling expat:get-error-code on the parser object.

2.3 Encodings

Expat supports the following encodings : UTF-8, UTF-16, ISO-8859-1, US-ASCII.

The encoding is usually indicated in the first line of an XML file (the <?xml... ?> declaration). But every data you will receive in your handlers (tag names, attributes, character data...), will be encoded in UTF-8, whatever the original encoding was. UTF-8 represents ASCII characters with no modification, but represents other characters with multi-byte characters. In other words, texts with non-ASCII characters look very strange on most terminals when they're encoded in UTF-8. ISO-8859-1 has a better support in standard editors, but is too euro-centric.

The encoding features of expat are not completely supported in Mixp. Using unknown encoding handlers will not work, or at least I have not tested that feature. However, XML documents which encoding (as specified in the <?xml... ?> declaration) is supported by expat should be parsed correctly. For example, you should get an error if you parse a document which claims to be US-ASCII but contains 8-bit characters.

2.4 Error handling

In the Expat interface, expat:parse returns 0 when an error is encountered, i.e when the document is not well-formed. Then expat:get-error-code should be called to retrieve an error code, as a Scheme symbol, which identifies the error. The error codes are listed in the documentation of expat:get-error-code (See section 2.2 Other expat functions).

The functions in the Mixp extensions use the same error codes, but they throw them as exceptions instead of returning 0. The following codes demonstrates simple error handling with mixp:parse-data :

(let ((bad-xml "<doc>dfssfd</do>"))
  (catch #t
    (lambda ()
      (call-with-input-string bad-xml mixp:parse-data))
    (lambda (key)
      (display "Received an error: ")(display key)(newline))))

@result{Received an error: expat:XML_ERROR_TAG_MISMATCH}

2.5 Not implemented

The following function is a part of the expat interface, but it was not implemented.

Function: XML_GetBuffer

You should also read the section about encodings See section 2.3 Encodings.

3 High-level extensions

The following functions are extensions to the raw expat interface, but I still don't know exactly what to do here.

Function: mixp:parse-file parser file
Parses an XML file See section 2.4 Error handling.

Function: mixp:parse-data port [parser]
Parses XML data coming from port See section 2.4 Error handling. If parser is specified, it must be a parser object, and it will be used to parse the data. Else a new parser object will be created with expat:parser-create, and this procedure will just check that the document is well-formed. port must be an open input port @xref{Ports,,,guile-ref}. Parsing will continue until the end of the port data is reached.

Function: mixp:call-with-input-string string parser
Parse a string containing an XML document See section 2.4 Error handling. parser should be created with another procedure such as expat:parser-create.

Function: mixp:call-with-input-file file parser
Parse an XML file See section 2.4 Error handling. parser should be created with another procedure such as expat:parser-create.

Function: mixp:xml->list port [parser]
Parses an XML document from port port and returns a list of XML nodes. Uses parser if specified, else creates a new parser with expat:parser-create. Each XML node is a small list which describes a part of the XML file. The first item in that list is a symbol which value is the node type. The meaning of the other items depend upon the node type. The following node types are supported (other kind of data in the XML file is ignored):

start-element
A start element. The second item in the node is the tag name, the third is an alist which represent the attributes.
end-element
An end element. The second item in the node is the tag name.
character-data
A character data element. The second item in the node is the contents of the item.
notation-decl
A notation declaration. The second, third, fourth and fifth items in the node are the notation name, the base, the system id and the public id.
entity-decl
An entity declaration. Only unparsed entity declarations are supported here. The second, third, fourth, fifth and sixth items in the node are the entity name, the base, the system id, the public id and the notation name.

Function: mixp:xml->tree port [parser]
Returns a tree structure of XML nodes. See mixp:xml->list above for the arguments, and for the supported node types. To give an idea of the tree structure which is supported, let us consider the following sample XML document.

<foo name='Paul'><bar>Some text</bar><void/></foo>

For this document, mixp:xml->list will return the following list:

((start-element "foo" (("name" . "Paul")))
 (start-element "bar" ())
 (character-data "Some text")
 (end-element "bar")
 (start-element "void" ())
 (end-element "void")
 (end-element "foo"))

And this is the data structure produced by mixp:xml->tree:

(element ("foo" (("name" . "Paul")))
	 (element ("bar" ())
		  (character-data "Some text"))
	 (element ("void" ())))

Hint: use call-with-input-file or call-with-input-string in conjunction with mixp:xml->list or mixp:xml->tree to create structured views of XML documents:

(call-with-input-file "foobar.xml" mixp:xml->tree)

Function: mixp:tree->list TREE
Transforms a tree, as returned by mixp:xml->tree, into a list, as returned by mixp:xml->list.

Function: mixp:list->tree LIST
Transforms a list, as returned by mixp:xml->list, into a tree, as returned by mixp:xml->tree.

Function: mixp:expat-error-code integer
Returns the Scheme symbol associated with the expat error code integer.

Function: mixp:parser? arg
Returns #t if the arg is a parser object (as created by expat:parser-create.

Concept index

e

  • encodings
  • h

  • handlers
  • l

  • loading Mixp
  • m

  • mixp components
  • s

  • sample programs
  • x

  • xml: API
  • xml:expat API (handlers)
  • xml:expat API (miscellaneous functions)
  • Function index

    e

  • expat:default-current
  • expat:error-string
  • expat:get-base
  • expat:get-current-byte-count
  • expat:get-current-byte-index
  • expat:get-current-column-number
  • expat:get-current-line-number
  • expat:get-error-code
  • expat:get-specified-attribute-count
  • expat:get-user-data
  • expat:parse
  • expat:parse-buffer
  • expat:parser-create
  • expat:parser-create-ns
  • expat:set-base
  • expat:set-cdata-section-handler
  • expat:set-character-data-handler
  • expat:set-comment-handler
  • expat:set-default-handler
  • expat:set-default-handler-expand
  • expat:set-element-handler
  • expat:set-encoding
  • expat:set-external-entity-ref-handler
  • expat:set-external-entity-ref-handler-arg
  • expat:set-namespace-decl-handler
  • expat:set-not-standalone-handler
  • expat:set-notation-decl-handler
  • expat:set-processing-instruction-handler
  • expat:set-unknown-encoding-handler
  • expat:set-unparsed-entity-decl-handler
  • expat:set-user-data
  • expat:use-parser-as-handler-arg
  • m

  • mixp:call-with-input-file
  • mixp:call-with-input-string
  • mixp:expat-error-code
  • mixp:list->tree
  • mixp:parse-data
  • mixp:parse-file
  • mixp:parser?
  • mixp:tree->list
  • mixp:xml->list
  • mixp:xml->tree
  • x

  • XML_GetBuffer

  • This document was generated on 12 August 2001 using the texi2html translator version 1.51.