-----------------------------------------------------------------------------
This file contains a concatenation of the PCRE2 man pages, converted to plain
text format for ease of searching with a text editor, or for use on systems
that do not have a man page processor. The small individual files that give
synopses of each function in the library have not been included. Neither has
the pcre2demo program. There are separate text files for the pcre2grep and
pcre2test commands.
-----------------------------------------------------------------------------


PCRE2(3)                   Library Functions Manual                   PCRE2(3)



NAME
       PCRE2 - Perl-compatible regular expressions (revised API)

INTRODUCTION

       PCRE2 is the name used for a revised API for the PCRE library, which is
       a set of functions, written in C,  that  implement  regular  expression
       pattern matching using the same syntax and semantics as Perl, with just
       a few differences. Some features that appeared in Python and the origi-
       nal  PCRE  before  they  appeared  in Perl are also available using the
       Python syntax. There is also some support for one or two .NET and Onig-
       uruma  syntax  items,  and  there are options for requesting some minor
       changes that give better ECMAScript (aka JavaScript) compatibility.

       The source code for PCRE2 can be compiled to support 8-bit, 16-bit,  or
       32-bit  code units, which means that up to three separate libraries may
       be installed.  The original work to extend PCRE to  16-bit  and  32-bit
       code  units  was  done  by Zoltan Herczeg and Christian Persch, respec-
       tively. In all three cases, strings can be interpreted  either  as  one
       character  per  code  unit, or as UTF-encoded Unicode, with support for
       Unicode general category properties. Unicode  support  is  optional  at
       build  time  (but  is  the default). However, processing strings as UTF
       code units must be enabled explicitly at run time. The version of  Uni-
       code in use can be discovered by running

         pcre2test -C

       The  three  libraries  contain  identical sets of functions, with names
       ending in _8,  _16,  or  _32,  respectively  (for  example,  pcre2_com-
       pile_8()).  However,  by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or
       32, a program that uses just one code unit width can be  written  using
       generic names such as pcre2_compile(), and the documentation is written
       assuming that this is the case.

       In addition to the Perl-compatible matching function, PCRE2 contains an
       alternative  function that matches the same compiled patterns in a dif-
       ferent way. In certain circumstances, the alternative function has some
       advantages.   For  a discussion of the two matching algorithms, see the
       pcre2matching page.

       Details of exactly which Perl regular expression features are  and  are
       not  supported  by  PCRE2  are  given  in  separate  documents. See the
       pcre2pattern and pcre2compat pages. There is a syntax  summary  in  the
       pcre2syntax page.

       Some  features  of PCRE2 can be included, excluded, or changed when the
       library is built. The pcre2_config() function makes it possible  for  a
       client  to  discover  which  features are available. The features them-
       selves are described in the pcre2build page. Documentation about build-
       ing  PCRE2 for various operating systems can be found in the README and
       NON-AUTOTOOLS_BUILD files in the source distribution.

       The libraries contains a number of undocumented internal functions  and
       data  tables  that  are  used by more than one of the exported external
       functions, but which are not intended  for  use  by  external  callers.
       Their  names  all begin with "_pcre2", which hopefully will not provoke
       any name clashes. In some environments, it is possible to control which
       external  symbols  are  exported when a shared library is built, and in
       these cases the undocumented symbols are not exported.


SECURITY CONSIDERATIONS

       If you are using PCRE2 in a non-UTF application that permits  users  to
       supply  arbitrary  patterns  for  compilation, you should be aware of a
       feature that allows users to turn on UTF support from within a pattern.
       For  example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8
       mode, which interprets patterns and subjects as strings of  UTF-8  code
       units instead of individual 8-bit characters. This causes both the pat-
       tern and any data against which it is matched to be checked  for  UTF-8
       validity.  If the data string is very long, such a check might use suf-
       ficiently many resources as to cause your application to  lose  perfor-
       mance.

       One  way  of guarding against this possibility is to use the pcre2_pat-
       tern_info() function to check the compiled pattern's options  for  UTF.
       Alternatively,  you can set the PCRE2_NEVER_UTF option at compile time.
       This causes an compile time error if a pattern contains  a  UTF-setting
       sequence.

       If  your  application  is one that supports UTF, be aware that validity
       checking can take time. If the same data string is to be  matched  many
       times,  you  can  use  the PCRE2_NO_UTF_CHECK option for the second and
       subsequent matches to avoid running redundant checks.

       Another way that performance can be hit is by running  a  pattern  that
       has  a  very  large search tree against a string that will never match.
       Nested unlimited repeats in a pattern are a common example. PCRE2  pro-
       vides  some  protection  against  this: see the pcre2_set_match_limit()
       function in the pcre2api page.


USER DOCUMENTATION

       The user documentation for PCRE2 comprises a number of  different  sec-
       tions.  In the "man" format, each of these is a separate "man page". In
       the HTML format, each is a separate page, linked from the  index  page.
       In  the  plain  text  format,  the  descriptions  of  the pcre2grep and
       pcre2test programs are in files called pcre2grep.txt and pcre2test.txt,
       respectively.  The remaining sections, except for the pcre2demo section
       (which is a program listing), and the short pages for individual  func-
       tions,  are  concatenated in pcre2.txt, for ease of searching. The sec-
       tions are as follows:

         pcre2              this document
         pcre2-config       show PCRE2 installation configuration information
         pcre2api           details of PCRE2's native C API
         pcre2build         building PCRE2
         pcre2callout       details of the callout feature
         pcre2compat        discussion of Perl compatibility
         pcre2demo          a demonstration C program that uses PCRE2
         pcre2grep          description of the pcre2grep command (8-bit only)
         pcre2jit           discussion of just-in-time optimization support
         pcre2limits        details of size and other limits
         pcre2matching      discussion of the two matching algorithms
         pcre2partial       details of the partial matching facility
         pcre2pattern       syntax and semantics of supported regular
                              expression patterns
         pcre2perform       discussion of performance issues
         pcre2posix         the POSIX-compatible C API for the 8-bit library
         pcre2sample        discussion of the pcre2demo program
         pcre2stack         discussion of stack usage
         pcre2syntax        quick syntax reference
         pcre2test          description of the pcre2test command
         pcre2unicode       discussion of Unicode and UTF support

       In the "man" and HTML formats, there is also a short page  for  each  C
       library function, listing its arguments and results.


AUTHOR

       Philip Hazel
       University Computing Service
       Cambridge, England.

       Putting  an  actual email address here is a spam magnet. If you want to
       email me, use my two initials, followed by the two digits  10,  at  the
       domain cam.ac.uk.


REVISION

       Last updated: 18 November 2014
       Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------


PCRE2API(3)                Library Functions Manual                PCRE2API(3)



NAME
       PCRE2 - Perl-compatible regular expressions (revised API)

       #include <pcre2.h>

       PCRE2  is  a  new API for PCRE. This document contains a description of
       all its functions. See the pcre2 document for an overview  of  all  the
       PCRE2 documentation.


PCRE2 NATIVE API BASIC FUNCTIONS

       pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
         uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
         pcre2_compile_context *ccontext);

       pcre2_code_free(pcre2_code *code);

       pcre2_match_data_create(uint32_t ovecsize,
         pcre2_general_context *gcontext);

       pcre2_match_data_create_from_pattern(const pcre2_code *code,
         pcre2_general_context *gcontext);

       int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
         PCRE2_SIZE length, PCRE2_SIZE startoffset,
         uint32_t options, pcre2_match_data *match_data,
         pcre2_match_context *mcontext);

       int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
         PCRE2_SIZE length, PCRE2_SIZE startoffset,
         uint32_t options, pcre2_match_data *match_data,
         pcre2_match_context *mcontext,
         int *workspace, PCRE2_SIZE wscount);

       void pcre2_match_data_free(pcre2_match_data *match_data);


PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS

       PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);

       uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);

       PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);

       PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);


PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS

       pcre2_general_context *pcre2_general_context_create(
         void *(*private_malloc)(PCRE2_SIZE, void *),
         void (*private_free)(void *, void *), void *memory_data);

       pcre2_general_context *pcre2_general_context_copy(
         pcre2_general_context *gcontext);

       void pcre2_general_context_free(pcre2_general_context *gcontext);


PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS

       pcre2_compile_context *pcre2_compile_context_create(
         pcre2_general_context *gcontext);

       pcre2_compile_context *pcre2_compile_context_copy(
         pcre2_compile_context *ccontext);

       void pcre2_compile_context_free(pcre2_compile_context *ccontext);

       int pcre2_set_bsr(pcre2_compile_context *ccontext,
         uint32_t value);

       int pcre2_set_character_tables(pcre2_compile_context *ccontext,
         const unsigned char *tables);

       int pcre2_set_newline(pcre2_compile_context *ccontext,
         uint32_t value);

       int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
         uint32_t value);

       int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
         int (*guard_function)(uint32_t, void *), void *user_data);


PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS

       pcre2_match_context *pcre2_match_context_create(
         pcre2_general_context *gcontext);

       pcre2_match_context *pcre2_match_context_copy(
         pcre2_match_context *mcontext);

       void pcre2_match_context_free(pcre2_match_context *mcontext);

       int pcre2_set_callout(pcre2_match_context *mcontext,
         int (*callout_function)(pcre2_callout_block *, void *),
         void *callout_data);

       int pcre2_set_match_limit(pcre2_match_context *mcontext,
         uint32_t value);

       int pcre2_set_recursion_limit(pcre2_match_context *mcontext,
         uint32_t value);

       int pcre2_set_recursion_memory_management(
         pcre2_match_context *mcontext,
         void *(*private_malloc)(PCRE2_SIZE, void *),
         void (*private_free)(void *, void *), void *memory_data);


PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS

       int pcre2_substring_copy_byname(pcre2_match_data *match_data,
         PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);

       int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
         uint32_t number, PCRE2_UCHAR *buffer,
         PCRE2_SIZE *bufflen);

       void pcre2_substring_free(PCRE2_UCHAR *buffer);

       int pcre2_substring_get_byname(pcre2_match_data *match_data,
         PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);

       int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
         uint32_t number, PCRE2_UCHAR **bufferptr,
         PCRE2_SIZE *bufflen);

       int pcre2_substring_length_byname(pcre2_match_data *match_data,
         PCRE2_SPTR name, PCRE2_SIZE *length);

       int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
         uint32_t number, PCRE2_SIZE *length);

       int pcre2_substring_nametable_scan(const pcre2_code *code,
         PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);

       int pcre2_substring_number_from_name(const pcre2_code *code,
         PCRE2_SPTR name);

       void pcre2_substring_list_free(PCRE2_SPTR *list);

       int pcre2_substring_list_get(pcre2_match_data *match_data,
         PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);


PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION

       int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
         PCRE2_SIZE length, PCRE2_SIZE startoffset,
         uint32_t options, pcre2_match_data *match_data,
         pcre2_match_context *mcontext, PCRE2_SPTR replacementzfP,
         PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer,
         PCRE2_SIZE *outlengthptr);


PCRE2 NATIVE API JIT FUNCTIONS

       int pcre2_jit_compile(pcre2_code *code, uint32_t options);

       int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
         PCRE2_SIZE length, PCRE2_SIZE startoffset,
         uint32_t options, pcre2_match_data *match_data,
         pcre2_match_context *mcontext);

       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);

       pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize,
         PCRE2_SIZE maxsize, pcre2_general_context *gcontext);

       void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
         pcre2_jit_callback callback_function, void *callback_data);

       void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);


PCRE2 NATIVE API AUXILIARY FUNCTIONS

       int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
         PCRE2_SIZE bufflen);

       const unsigned char *pcre2_maketables(pcre2_general_context *gcontext);

       int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where);

       int pcre2_config(uint32_t what, void *where);


PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES

       There  are  three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit
       code units, respectively. However,  there  is  just  one  header  file,
       pcre2.h.   This  contains the function prototypes and other definitions
       for all three libraries. One, two, or all three can be installed simul-
       taneously.  On  Unix-like  systems the libraries are called libpcre2-8,
       libpcre2-16, and libpcre2-32, and they can also co-exist with the orig-
       inal PCRE libraries.

       Character  strings are passed to and from a PCRE2 library as a sequence
       of unsigned integers in code units  of  the  appropriate  width.  Every
       PCRE2  function  comes  in three different forms, one for each library,
       for example:

         pcre2_compile_8()
         pcre2_compile_16()
         pcre2_compile_32()

       There are also three different sets of data types:

         PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32
         PCRE2_SPTR8,  PCRE2_SPTR16,  PCRE2_SPTR32

       The UCHAR types define unsigned code units of the  appropriate  widths.
       For  example,  PCRE2_UCHAR16 is usually defined as `uint16_t'. The SPTR
       types are constant pointers to the equivalent  UCHAR  types,  that  is,
       they are pointers to vectors of unsigned code units.

       Many  applications use only one code unit width. For their convenience,
       macros are defined whose names are the generic forms such as pcre2_com-
       pile()  and  PCRE2_SPTR.  These  macros  use  the  value  of  the macro
       PCRE2_CODE_UNIT_WIDTH to generate the appropriate width-specific  func-
       tion and macro names.  PCRE2_CODE_UNIT_WIDTH is not defined by default.
       An application must define it to be  8,  16,  or  32  before  including
       pcre2.h in order to make use of the generic names.

       Applications  that use more than one code unit width can be linked with
       more than one PCRE2 library, but must define  PCRE2_CODE_UNIT_WIDTH  to
       be  0  before  including pcre2.h, and then use the real function names.
       Any code that is to be included in an environment where  the  value  of
       PCRE2_CODE_UNIT_WIDTH  is  unknown  should  also  use the real function
       names. (Unfortunately, it is not possible in C code to save and restore
       the value of a macro.)

       If  PCRE2_CODE_UNIT_WIDTH  is  not  defined before including pcre2.h, a
       compiler error occurs.

       When using multiple libraries in an application,  you  must  take  care
       when  processing  any  particular  pattern to use only functions from a
       single library.  For example, if you want to run a match using  a  pat-
       tern  that  was  compiled  with pcre2_compile_16(), you must do so with
       pcre2_match_16(), not pcre2_match_8().

       In the function summaries above, and in the rest of this  document  and
       other  PCRE2  documents,  functions  and data types are described using
       their generic names, without the 8, 16, or 32 suffix.


PCRE2 API OVERVIEW

       PCRE2 has its own native API, which  is  described  in  this  document.
       There are also some wrapper functions for the 8-bit library that corre-
       spond to the POSIX regular expression API, but they do not give  access
       to all the functionality. They are described in the pcre2posix documen-
       tation. Both these APIs define a set of C function calls.

       The native API C data types, function prototypes,  option  values,  and
       error codes are defined in the header file pcre2.h, which contains def-
       initions of PCRE2_MAJOR and PCRE2_MINOR, the major  and  minor  release
       numbers  for the library. Applications can use these to include support
       for different releases of PCRE2.

       In a Windows environment, if you want to statically link an application
       program  against  a non-dll PCRE2 library, you must define PCRE2_STATIC
       before including pcre2.h.

       The functions pcre2_compile(), and pcre2_match() are used for compiling
       and  matching regular expressions in a Perl-compatible manner. A sample
       program that demonstrates the simplest way of using them is provided in
       the file called pcre2demo.c in the PCRE2 source distribution. A listing
       of this program is  given  in  the  pcre2demo  documentation,  and  the
       pcre2sample documentation describes how to compile and run it.

       Just-in-time  compiler support is an optional feature of PCRE2 that can
       be built in appropriate hardware environments. It greatly speeds up the
       matching  performance of many patterns. Programs can request that it be
       used if available, by calling pcre2_jit_compile() after a  pattern  has
       been successfully compiled by pcre2_compile(). This does nothing if JIT
       support is not available.

       More complicated programs might need to  make  use  of  the  specialist
       functions    pcre2_jit_stack_create(),    pcre2_jit_stack_free(),   and
       pcre2_jit_stack_assign() in order to  control  the  JIT  code's  memory
       usage.

       JIT matching is automatically used by pcre2_match() if it is available.
       There is also a direct interface for JIT matching, which gives improved
       performance.  The  JIT-specific functions are discussed in the pcre2jit
       documentation.

       A second matching function, pcre2_dfa_match(), which is  not  Perl-com-
       patible,  is  also  provided.  This  uses a different algorithm for the
       matching. The alternative algorithm finds all possible  matches  (at  a
       given  point  in  the subject), and scans the subject just once (unless
       there are lookbehind assertions).  However,  this  algorithm  does  not
       return  captured  substrings.  A  description of the two matching algo-
       rithms  and  their  advantages  and  disadvantages  is  given  in   the
       pcre2matching    documentation.   There   is   no   JIT   support   for
       pcre2_dfa_match().

       In addition to the main compiling and  matching  functions,  there  are
       convenience functions for extracting captured substrings from a subject
       string that has been matched by pcre2_match(). They are:

         pcre2_substring_copy_byname()
         pcre2_substring_copy_bynumber()
         pcre2_substring_get_byname()
         pcre2_substring_get_bynumber()
         pcre2_substring_list_get()
         pcre2_substring_length_byname()
         pcre2_substring_length_bynumber()
         pcre2_substring_nametable_scan()
         pcre2_substring_number_from_name()

       pcre2_substring_free() and pcre2_substring_list_free()  are  also  pro-
       vided, to free the memory used for extracted strings.

       The  function  pcre2_substitute()  can be called to match a pattern and
       return a copy of the subject string with substitutions for  parts  that
       were matched.

       Finally,  there  are functions for finding out information about a com-
       piled pattern (pcre2_pattern_info()) and about the  configuration  with
       which PCRE2 was built (pcre2_config()).


STRING LENGTHS AND OFFSETS

       The  PCRE2  API  uses  string  lengths and offsets into strings of code
       units in several places. These values are always  of  type  PCRE2_SIZE,
       which  is an unsigned integer type, currently always defined as size_t.
       The largest  value  that  can  be  stored  in  such  a  type  (that  is
       ~(PCRE2_SIZE)0)  is reserved as a special indicator for zero-terminated
       strings and unset offsets.  Therefore, the longest string that  can  be
       handled is one less than this maximum.


NEWLINES

       PCRE2 supports five different conventions for indicating line breaks in
       strings: a single CR (carriage return) character, a  single  LF  (line-
       feed) character, the two-character sequence CRLF, any of the three pre-
       ceding, or any Unicode newline sequence. The Unicode newline  sequences
       are  the  three just mentioned, plus the single characters VT (vertical
       tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
       separator, U+2028), and PS (paragraph separator, U+2029).

       Each  of  the first three conventions is used by at least one operating
       system as its standard newline sequence. When PCRE2 is built, a default
       can  be  specified.  The default default is LF, which is the Unix stan-
       dard. However, the newline convention can be changed by an  application
       when calling pcre2_compile(), or it can be specified by special text at
       the start of the pattern itself; this overrides any other settings. See
       the pcre2pattern page for details of the special character sequences.

       In  the  PCRE2  documentation  the  word "newline" is used to mean "the
       character or pair of characters that indicate a line break". The choice
       of  newline convention affects the handling of the dot, circumflex, and
       dollar metacharacters, the handling of #-comments in /x mode, and, when
       CRLF  is a recognized line ending sequence, the match position advance-
       ment for a non-anchored pattern. There is more detail about this in the
       section on pcre2_match() options below.

       The  choice of newline convention does not affect the interpretation of
       the \n or \r escape sequences, nor does it affect what \R matches; this
       has its own separate convention.


MULTITHREADING

       In  a multithreaded application it is important to keep thread-specific
       data separate from data that can be shared between threads.  The  PCRE2
       library  code  itself  is  thread-safe: it contains no static or global
       variables. The API is designed to be  fairly  simple  for  non-threaded
       applications  while at the same time ensuring that multithreaded appli-
       cations can use it.

       There are several different blocks of data that are used to pass infor-
       mation between the application and the PCRE2 libraries.

       (1) A pointer to the compiled form of a pattern is returned to the user
       when pcre2_compile() is successful. The data in the compiled pattern is
       fixed,  and  does not change when the pattern is matched. Therefore, it
       is thread-safe, that is, the same compiled pattern can be used by  more
       than one thread simultaneously. An application can compile all its pat-
       terns at the start, before forking off multiple threads that use  them.
       However,  if  the  just-in-time  optimization feature is being used, it
       needs separate memory stack areas for each  thread.  See  the  pcre2jit
       documentation for more details.

       (2)  The  next section below introduces the idea of "contexts" in which
       PCRE2 functions are called. A context is nothing more than a collection
       of parameters that control the way PCRE2 operates. Grouping a number of
       parameters together in a context is a convenient way of passing them to
       a  PCRE2  function without using lots of arguments. The parameters that
       are stored in contexts are in some sense  "advanced  features"  of  the
       API. Many straightforward applications will not need to use contexts.

       In a multithreaded application, if the parameters in a context are val-
       ues that are never changed, the same context can be  used  by  all  the
       threads. However, if any thread needs to change any value in a context,
       it must make its own thread-specific copy.

       (3) The matching functions need a block of memory for working space and
       for  storing  the results of a match. This includes details of what was
       matched, as well as additional  information  such  as  the  name  of  a
       (*MARK)  setting. Each thread must provide its own version of this mem-
       ory.


PCRE2 CONTEXTS

       Some PCRE2 functions have a lot of parameters, many of which  are  used
       only  by  specialist  applications,  for example, those that use custom
       memory management or non-standard character tables.  To  keep  function
       argument  lists  at a reasonable size, and at the same time to keep the
       API extensible, "uncommon" parameters are passed to  certain  functions
       in  a  context instead of directly. A context is just a block of memory
       that holds the parameter values.  Applications  that  do  not  need  to
       adjust  any  of  the  context  parameters  can pass NULL when a context
       pointer is required.

       There are three different types of context: a general context  that  is
       relevant  for  several  PCRE2 operations, a compile-time context, and a
       match-time context.

   The general context

       At present, this context just  contains  pointers  to  (and  data  for)
       external  memory  management  functions  that  are  called from several
       places in the PCRE2 library. The context is named `general' rather than
       specifically  `memory'  because in future other fields may be added. If
       you do not want to supply your own custom memory management  functions,
       you  do not need to bother with a general context. A general context is
       created by:

       pcre2_general_context *pcre2_general_context_create(
         void *(*private_malloc)(PCRE2_SIZE, void *),
         void (*private_free)(void *, void *), void *memory_data);

       The two function pointers specify custom memory  management  functions,
       whose prototypes are:

         void *private_malloc(PCRE2_SIZE, void *);
         void  private_free(void *, void *);

       Whenever code in PCRE2 calls these functions, the final argument is the
       value of memory_data. Either of the first two arguments of the creation
       function  may be NULL, in which case the system memory management func-
       tions malloc() and free() are used. (This is not currently  useful,  as
       there  are  no  other  fields in a general context, but in future there
       might be.)  The private_malloc() function  is  used  (if  supplied)  to
       obtain  memory  for storing the context, and all three values are saved
       as part of the context.

       Whenever PCRE2 creates a data block of any kind, the block  contains  a
       pointer  to the free() function that matches the malloc() function that
       was used. When the time comes to  free  the  block,  this  function  is
       called.

       A general context can be copied by calling:

       pcre2_general_context *pcre2_general_context_copy(
         pcre2_general_context *gcontext);

       The memory used for a general context should be freed by calling:

       void pcre2_general_context_free(pcre2_general_context *gcontext);


   The compile context

       A  compile context is required if you want to change the default values
       of any of the following compile-time parameters:

         What \R matches (Unicode newlines or CR, LF, CRLF only)
         PCRE2's character tables
         The newline character sequence
         The compile time nested parentheses limit
         An external function for stack checking

       A compile context is also required if you are using custom memory  man-
       agement.   If  none of these apply, just pass NULL as the context argu-
       ment of pcre2_compile().

       A compile context is created, copied, and freed by the following  func-
       tions:

       pcre2_compile_context *pcre2_compile_context_create(
         pcre2_general_context *gcontext);

       pcre2_compile_context *pcre2_compile_context_copy(
         pcre2_compile_context *ccontext);

       void pcre2_compile_context_free(pcre2_compile_context *ccontext);

       A  compile  context  is created with default values for its parameters.
       These can be changed by calling the following functions, which return 0
       on success, or PCRE2_ERROR_BADDATA if invalid data is detected.

       int pcre2_set_bsr(pcre2_compile_context *ccontext,
         uint32_t value);

       The  value  must  be PCRE2_BSR_ANYCRLF, to specify that \R matches only
       CR, LF, or CRLF, or PCRE2_BSR_UNICODE, to specify that \R  matches  any
       Unicode line ending sequence. The value is used by the JIT compiler and
       by  the  two  interpreted   matching   functions,   pcre2_match()   and
       pcre2_dfa_match().

       int pcre2_set_character_tables(pcre2_compile_context *ccontext,
         const unsigned char *tables);

       The  value  must  be  the result of a call to pcre2_maketables(), whose
       only argument is a general context. This function builds a set of char-
       acter tables in the current locale.

       int pcre2_set_newline(pcre2_compile_context *ccontext,
         uint32_t value);

       This specifies which characters or character sequences are to be recog-
       nized as newlines. The value must be one of PCRE2_NEWLINE_CR  (carriage
       return only), PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the
       two-character sequence CR followed by LF),  PCRE2_NEWLINE_ANYCRLF  (any
       of the above), or PCRE2_NEWLINE_ANY (any Unicode newline sequence).

       When a pattern is compiled with the PCRE2_EXTENDED option, the value of
       this parameter affects the recognition of white space and  the  end  of
       internal comments starting with #. The value is saved with the compiled
       pattern for subsequent use by the JIT compiler and by  the  two  inter-
       preted matching functions, pcre2_match() and pcre2_dfa_match().

       int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
         uint32_t value);

       This parameter ajusts the limit, set when PCRE2 is built (default 250),
       on the depth of parenthesis nesting in  a  pattern.  This  limit  stops
       rogue patterns using up too much system stack when being compiled.

       int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
         int (*guard_function)(uint32_t, void *), void *user_data);

       There  is at least one application that runs PCRE2 in threads with very
       limited system stack, where running out of stack is to  be  avoided  at
       all  costs. The parenthesis limit above cannot take account of how much
       stack is actually available. For a finer  control,  you  can  supply  a
       function  that  is  called whenever pcre2_compile() starts to compile a
       parenthesized part of a pattern. This function  can  check  the  actual
       stack size (or anything else that it wants to, of course).

       The  first  argument to the callout function gives the current depth of
       nesting, and the second is user data that is set up by the  last  argu-
       ment   of  pcre2_set_compile_recursion_guard().  The  callout  function
       should return zero if all is well, or non-zero to force an error.

   The match context

       A match context is required if you want to change the default values of
       any of the following match-time parameters:

         A callout function
         The limit for calling match()
         The limit for calling match() recursively

       A match context is also required if you are using custom memory manage-
       ment.  If none of these apply, just pass NULL as the  context  argument
       of pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match().

       A  match  context  is created, copied, and freed by the following func-
       tions:

       pcre2_match_context *pcre2_match_context_create(
         pcre2_general_context *gcontext);

       pcre2_match_context *pcre2_match_context_copy(
         pcre2_match_context *mcontext);

       void pcre2_match_context_free(pcre2_match_context *mcontext);

       A match context is created with  default  values  for  its  parameters.
       These can be changed by calling the following functions, which return 0
       on success, or PCRE2_ERROR_BADDATA if invalid data is detected.

       int pcre2_set_callout(pcre2_match_context *mcontext,
         int (*callout_function)(pcre2_callout_block *, void *),
         void *callout_data);

       This sets up a "callout" function, which PCRE2 will call  at  specified
       points during a matching operation. Details are given in the pcre2call-
       out documentation.

       int pcre2_set_match_limit(pcre2_match_context *mcontext,
         uint32_t value);

       The match_limit parameter provides a means  of  preventing  PCRE2  from
       using up too many resources when processing patterns that are not going
       to match, but which have a very large number of possibilities in  their
       search  trees. The classic example is a pattern that uses nested unlim-
       ited repeats.

       Internally, pcre2_match() uses a  function  called  match(),  which  it
       calls  repeatedly (sometimes recursively). The limit set by match_limit
       is imposed on the number of times this  function  is  called  during  a
       match, which has the effect of limiting the amount of backtracking that
       can take place. For patterns that are not anchored, the count  restarts
       from  zero  for  each position in the subject string. This limit is not
       relevant to pcre2_dfa_match(), which ignores it.

       When pcre2_match() is called with a pattern that was successfully  pro-
       cessed by pcre2_jit_compile(), the way in which matching is executed is
       entirely different. However, there is still the possibility of  runaway
       matching  that  goes  on  for  a very long time, and so the match_limit
       value is also used in this case (but in a different way) to  limit  how
       long the matching can continue.

       The  default  value  for  the limit can be set when PCRE2 is built; the
       default default is 10 million, which handles all but the  most  extreme
       cases.    If    the    limit   is   exceeded,   pcre2_match()   returns
       PCRE2_ERROR_MATCHLIMIT. A value for the match limit may  also  be  sup-
       plied by an item at the start of a pattern of the form

         (*LIMIT_MATCH=ddd)

       where  ddd  is  a  decimal  number.  However, such a setting is ignored
       unless ddd is less than the limit set by the  caller  of  pcre2_match()
       or, if no such limit is set, less than the default.

       int pcre2_set_recursion_limit(pcre2_match_context *mcontext,
         uint32_t value);

       The recursion_limit parameter is similar to match_limit, but instead of
       limiting the total number of times that match() is  called,  it  limits
       the  depth  of  recursion. The recursion depth is a smaller number than
       the total number of calls, because not all calls to match() are  recur-
       sive.  This limit is of use only if it is set smaller than match_limit.

       Limiting the recursion depth limits the amount of system stack that can
       be used, or, when PCRE2 has been compiled to use  memory  on  the  heap
       instead  of the stack, the amount of heap memory that can be used. This
       limit is not relevant, and is ignored, when matching is done using  JIT
       compiled code or by the pcre2_dfa_match() function.

       The  default  value for recursion_limit can be set when PCRE2 is built;
       the default default is the same value as the default  for  match_limit.
       If  the limit is exceeded, pcre2_match() returns PCRE2_ERROR_RECURSION-
       LIMIT. A value for the recursion limit may also be supplied by an  item
       at the start of a pattern of the form

         (*LIMIT_RECURSION=ddd)

       where  ddd  is  a  decimal  number.  However, such a setting is ignored
       unless ddd is less than the limit set by the  caller  of  pcre2_match()
       or, if no such limit is set, less than the default.

       int pcre2_set_recursion_memory_management(
         pcre2_match_context *mcontext,
         void *(*private_malloc)(PCRE2_SIZE, void *),
         void (*private_free)(void *, void *), void *memory_data);

       This function sets up two additional custom memory management functions
       for use by pcre2_match() when PCRE2 is compiled to  use  the  heap  for
       remembering backtracking data, instead of recursive function calls that
       use the system stack. There is a discussion about PCRE2's  stack  usage
       in  the  pcre2stack documentation. See the pcre2build documentation for
       details of how to build PCRE2.

       Using the heap for recursion is a non-standard way of  building  PCRE2,
       for  use  in  environments  that  have  limited  stacks. Because of the
       greater use of memory management, pcre2_match() runs more slowly. Func-
       tions  that  are  different  to the general custom memory functions are
       provided so that special-purpose external code can  be  used  for  this
       case,  because  the memory blocks are all the same size. The blocks are
       retained by pcre2_match() until it is about to exit so that they can be
       re-used  when  possible during the match. In the absence of these func-
       tions, the normal custom memory management functions are used, if  sup-
       plied, otherwise the system functions.


CHECKING BUILD-TIME OPTIONS

       int pcre2_config(uint32_t what, void *where);

       The  function  pcre2_config()  makes  it possible for a PCRE2 client to
       discover which optional features have  been  compiled  into  the  PCRE2
       library.  The  pcre2build  documentation  has  more details about these
       optional features.

       The first argument for pcre2_config() specifies  which  information  is
       required.  The  second  argument  is a pointer to memory into which the
       information is placed. If NULL is  passed,  the  function  returns  the
       amount  of  memory  that  is  needed for the requested information. For
       calls that return  numerical  values,  the  value  is  in  bytes;  when
       requesting  these  values,  where should point to appropriately aligned
       memory. For calls that return strings, the required length is given  in
       code units, not counting the terminating zero.

       When  requesting information, the returned value from pcre2_config() is
       non-negative on success, or the negative error code  PCRE2_ERROR_BADOP-
       TION  if the value in the first argument is not recognized. The follow-
       ing information is available:

         PCRE2_CONFIG_BSR

       The output is a uint32_t integer whose value indicates  what  character
       sequences  the  \R  escape  sequence  matches  by  default.  A value of
       PCRE2_BSR_UNICODE  means  that  \R  matches  any  Unicode  line  ending
       sequence;  a  value of PCRE2_BSR_ANYCRLF means that \R matches only CR,
       LF, or CRLF. The default can be overridden when a pattern is compiled.

         PCRE2_CONFIG_JIT

       The output is a uint32_t integer that is set  to  one  if  support  for
       just-in-time compiling is available; otherwise it is set to zero.

         PCRE2_CONFIG_JITTARGET

       The  where  argument  should point to a buffer that is at least 48 code
       units long.  (The  exact  length  required  can  be  found  by  calling
       pcre2_config()  with  where  set  to NULL.) The buffer is filled with a
       string that contains the name of the architecture  for  which  the  JIT
       compiler  is  configured,  for  example  "x86  32bit  (little  endian +
       unaligned)". If JIT support is not available, PCRE2_ERROR_BADOPTION  is
       returned,  otherwise the number of code units used is returned. This is
       the length of the string, plus one unit for the terminating zero.

         PCRE2_CONFIG_LINKSIZE

       The output is a uint32_t integer that contains the number of bytes used
       for  internal  linkage  in  compiled regular expressions. When PCRE2 is
       configured, the value can be set to 2, 3, or 4, with the default  being
       2.  This is the value that is returned by pcre2_config(). However, when
       the 16-bit library is compiled, a value of 3 is rounded up  to  4,  and
       when  the  32-bit  library  is compiled, internal linkages always use 4
       bytes, so the configured value is not relevant.

       The default value of 2 for the 8-bit and 16-bit libraries is sufficient
       for  all but the most massive patterns, since it allows the size of the
       compiled pattern to be up to 64K code units. Larger values allow larger
       regular  expressions  to be compiled by those two libraries, but at the
       expense of slower matching.

         PCRE2_CONFIG_MATCHLIMIT

       The output is a uint32_t integer that gives the default limit  for  the
       number  of  internal  matching function calls in a pcre2_match() execu-
       tion. Further details are given with pcre2_match() below.

         PCRE2_CONFIG_NEWLINE

       The output is a uint32_t integer  whose  value  specifies  the  default
       character  sequence that is recognized as meaning "newline". The values
       are:

         PCRE2_NEWLINE_CR       Carriage return (CR)
         PCRE2_NEWLINE_LF       Linefeed (LF)
         PCRE2_NEWLINE_CRLF     Carriage return, linefeed (CRLF)
         PCRE2_NEWLINE_ANY      Any Unicode line ending
         PCRE2_NEWLINE_ANYCRLF  Any of CR, LF, or CRLF

       The default should normally correspond to  the  standard  sequence  for
       your operating system.

         PCRE2_CONFIG_PARENSLIMIT

       The  output is a uint32_t integer that gives the maximum depth of nest-
       ing of parentheses (of any kind) in a pattern. This limit is imposed to
       cap  the  amount of system stack used when a pattern is compiled. It is
       specified when PCRE2 is built; the default is 250. This limit does  not
       take  into  account  the  stack that may already be used by the calling
       application. For  finer  control  over  compilation  stack  usage,  see
       pcre2_set_compile_recursion_guard().

         PCRE2_CONFIG_RECURSIONLIMIT

       The  output  is a uint32_t integer that gives the default limit for the
       depth of recursion when calling the internal  matching  function  in  a
       pcre2_match()  execution.  Further details are given with pcre2_match()
       below.

         PCRE2_CONFIG_STACKRECURSE

       The output is a uint32_t integer that is set to one if internal  recur-
       sion  when  running  pcre2_match() is implemented by recursive function
       calls that use the system stack to remember their state.  This  is  the
       usual  way that PCRE2 is compiled. The output is zero if PCRE2 was com-
       piled to use blocks of data on the heap instead of  recursive  function
       calls.

         PCRE2_CONFIG_UNICODE_VERSION

       The  where  argument  should point to a buffer that is at least 24 code
       units long.  (The  exact  length  required  can  be  found  by  calling
       pcre2_config()  with  where  set  to  NULL.) If PCRE2 has been compiled
       without Unicode support, the buffer is filled with  the  text  "Unicode
       not  supported".  Otherwise,  the  Unicode version string (for example,
       "7.0.0") is inserted. The number of code units used is  returned.  This
       is the length of the string plus one unit for the terminating zero.

         PCRE2_CONFIG_UNICODE

       The  output is a uint32_t integer that is set to one if Unicode support
       is available; otherwise it is set to zero. Unicode support implies  UTF
       support.

         PCRE2_CONFIG_VERSION

       The  where  argument  should point to a buffer that is at least 12 code
       units long.  (The  exact  length  required  can  be  found  by  calling
       pcre2_config()  with  where set to NULL.) The buffer is filled with the
       PCRE2 version string, zero-terminated. The number of code units used is
       returned. This is the length of the string plus one unit for the termi-
       nating zero.


COMPILING A PATTERN

       pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
         uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
         pcre2_compile_context *ccontext);

       pcre2_code_free(pcre2_code *code);

       The pcre2_compile() function compiles a pattern into an internal  form.
       The  pattern  is  defined  by a pointer to a string of code units and a
       length, If the pattern is zero-terminated, the length can be  specified
       as  PCRE2_ZERO_TERMINATED. The function returns a pointer to a block of
       memory that contains the compiled pattern and related data. The  caller
       must  free the memory by calling pcre2_code_free() when it is no longer
       needed.

       NOTE: When one of the matching functions is  called,  pointers  to  the
       compiled pattern and the subject string are set in the match data block
       so that they can be referenced by the extraction functions. After  run-
       ning  a  match,  you  must  not  free  a compiled pattern (or a subject
       string) until after all operations on the match data block  have  taken
       place.

       If  the  compile context argument ccontext is NULL, memory for the com-
       piled pattern  is  obtained  by  calling  malloc().  Otherwise,  it  is
       obtained  from  the  same memory function that was used for the compile
       context.

       The options argument contains various bit settings that affect the com-
       pilation.  It  should be zero if no options are required. The available
       options are described below. Some of them (in  particular,  those  that
       are  compatible with Perl, but some others as well) can also be set and
       unset from within the pattern (see  the  detailed  description  in  the
       pcre2pattern documentation).

       For  those options that can be different in different parts of the pat-
       tern, the contents of the options argument specifies their settings  at
       the  start  of  compilation.  The PCRE2_ANCHORED and PCRE2_NO_UTF_CHECK
       options can be set at the time of matching as well as at compile time.

       Other, less frequently required compile-time parameters  (for  example,
       the newline setting) can be provided in a compile context (as described
       above).

       If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme-
       diately.  Otherwise, if compilation of a pattern fails, pcre2_compile()
       returns NULL, having set these variables to an error code and an offset
       (number   of   code   units)  within  the  pattern,  respectively.  The
       pcre2_get_error_message() function provides a textual message for  each
       error code. Compilation errors are positive numbers, but UTF formatting
       errors are negative numbers. For an invalid UTF-8 or UTF-16 string, the
       offset is that of the first code unit of the failing character.

       Some  errors are not detected until the whole pattern has been scanned;
       in these cases, the offset passed back is the length  of  the  pattern.
       Note  that  the  offset is in code units, not characters, even in a UTF
       mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
       acter.

       This  code  fragment shows a typical straightforward call to pcre2_com-
       pile():

         pcre2_code *re;
         PCRE2_SIZE erroffset;
         int errorcode;
         re = pcre2_compile(
           "^A.*Z",                /* the pattern */
           PCRE2_ZERO_TERMINATED,  /* the pattern is zero-terminated */
           0,                      /* default options */
           &errorcode,             /* for error code */
           &erroffset,             /* for error offset */
           NULL);                  /* no compile context */

       The following names for option bits are defined in the  pcre2.h  header
       file:

         PCRE2_ANCHORED

       If this bit is set, the pattern is forced to be "anchored", that is, it
       is constrained to match only at the first matching point in the  string
       that  is being searched (the "subject string"). This effect can also be
       achieved by appropriate constructs in the pattern itself, which is  the
       only way to do it in Perl.

         PCRE2_ALLOW_EMPTY_CLASS

       By  default, for compatibility with Perl, a closing square bracket that
       immediately follows an opening one is treated as a data  character  for
       the  class.  When  PCRE2_ALLOW_EMPTY_CLASS  is  set,  it terminates the
       class, which therefore contains no characters and so can never match.

         PCRE2_ALT_BSUX

       This option request alternative handling  of  three  escape  sequences,
       which  makes  PCRE2's  behaviour more like ECMAscript (aka JavaScript).
       When it is set:

       (1) \U matches an upper case "U" character; by default \U causes a com-
       pile time error (Perl uses \U to upper case subsequent characters).

       (2) \u matches a lower case "u" character unless it is followed by four
       hexadecimal digits, in which case the hexadecimal  number  defines  the
       code  point  to match. By default, \u causes a compile time error (Perl
       uses it to upper case the following character).

       (3) \x matches a lower case "x" character unless it is followed by  two
       hexadecimal  digits,  in  which case the hexadecimal number defines the
       code point to match. By default, as in Perl, a  hexadecimal  number  is
       always expected after \x, but it may have zero, one, or two digits (so,
       for example, \xz matches a binary zero character followed by z).

         PCRE2_AUTO_CALLOUT

       If this bit  is  set,  pcre2_compile()  automatically  inserts  callout
       items, all with number 255, before each pattern item. For discussion of
       the callout facility, see the pcre2callout documentation.

         PCRE2_CASELESS

       If this bit is set, letters in the pattern match both upper  and  lower
       case  letters in the subject. It is equivalent to Perl's /i option, and
       it can be changed within a pattern by a (?i) option setting.

         PCRE2_DOLLAR_ENDONLY

       If this bit is set, a dollar metacharacter in the pattern matches  only
       at  the  end  of the subject string. Without this option, a dollar also
       matches immediately before a newline at the end of the string (but  not
       before  any other newlines). The PCRE2_DOLLAR_ENDONLY option is ignored
       if PCRE2_MULTILINE is set. There is no equivalent  to  this  option  in
       Perl, and no way to set it within a pattern.

         PCRE2_DOTALL

       If  this  bit  is  set,  a dot metacharacter in the pattern matches any
       character, including one that indicates a  newline.  However,  it  only
       ever matches one character, even if newlines are coded as CRLF. Without
       this option, a dot does not match when the current position in the sub-
       ject  is  at  a newline. This option is equivalent to Perl's /s option,
       and it can be changed within a pattern by a (?s) option setting. A neg-
       ative class such as [^a] always matches newline characters, independent
       of the setting of this option.

         PCRE2_DUPNAMES

       If this bit is set, names used to identify capturing  subpatterns  need
       not be unique. This can be helpful for certain types of pattern when it
       is known that only one instance of the named  subpattern  can  ever  be
       matched.  There  are  more details of named subpatterns below; see also
       the pcre2pattern documentation.

         PCRE2_EXTENDED

       If this bit is set, most white space  characters  in  the  pattern  are
       totally  ignored  except when escaped or inside a character class. How-
       ever, white space is not allowed within  sequences  such  as  (?>  that
       introduce various parenthesized subpatterns, nor within numerical quan-
       tifiers such as {1,3}.  Ignorable white space is permitted  between  an
       item  and a following quantifier and between a quantifier and a follow-
       ing + that indicates possessiveness.

       PCRE2_EXTENDED also causes characters between an unescaped # outside  a
       character  class  and the next newline, inclusive, to be ignored, which
       makes it possible to include comments inside complicated patterns. Note
       that  the  end of this type of comment is a literal newline sequence in
       the pattern; escape sequences that happen to represent a newline do not
       count.  PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be
       changed within a pattern by a (?x) option setting.

       Which characters are interpreted as newlines can be specified by a set-
       ting  in  the compile context that is passed to pcre2_compile() or by a
       special sequence at the start of the pattern, as described in the  sec-
       tion  entitled "Newline conventions" in the pcre2pattern documentation.
       A default is defined when PCRE2 is built.

         PCRE2_FIRSTLINE

       If this option is set, an  unanchored  pattern  is  required  to  match
       before  or  at  the  first  newline  in  the subject string, though the
       matched text may continue over the newline.

         PCRE2_MATCH_UNSET_BACKREF

       If this option is set, a back reference to an  unset  subpattern  group
       matches  an  empty  string (by default this causes the current matching
       alternative to fail).  A pattern such as  (\1)(a)  succeeds  when  this
       option  is set (assuming it can find an "a" in the subject), whereas it
       fails by default, for Perl compatibility.  Setting  this  option  makes
       PCRE2 behave more like ECMAscript (aka JavaScript).

         PCRE2_MULTILINE

       By  default,  for  the purposes of matching "start of line" and "end of
       line", PCRE2 treats the subject string as consisting of a  single  line
       of  characters,  even  if  it actually contains newlines. The "start of
       line" metacharacter (^) matches only at the start of  the  string,  and
       the  "end  of  line"  metacharacter  ($) matches only at the end of the
       string,  or  before  a  terminating  newline  (except  when  PCRE2_DOL-
       LAR_ENDONLY  is  set).  Note, however, that unless PCRE2_DOTALL is set,
       the "any character" metacharacter (.) does not match at a newline. This
       behaviour (for ^, $, and dot) is the same as Perl.

       When  PCRE2_MULTILINE  it is set, the "start of line" and "end of line"
       constructs match immediately following or immediately  before  internal
       newlines  in  the  subject string, respectively, as well as at the very
       start and end. This is equivalent to Perl's /m option, and  it  can  be
       changed within a pattern by a (?m) option setting. If there are no new-
       lines in a subject string, or no occurrences of ^ or $  in  a  pattern,
       setting PCRE2_MULTILINE has no effect.

         PCRE2_NEVER_UCP

       This  option  locks  out the use of Unicode properties for handling \B,
       \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as
       described  for  the  PCRE2_UCP option below. In particular, it prevents
       the creator of the pattern from enabling this facility by starting  the
       pattern  with  (*UCP).  This may be useful in applications that process
       patterns from external sources. The  option  combination  PCRE_UCP  and
       PCRE_NEVER_UCP causes an error.

         PCRE2_NEVER_UTF

       This  option  locks out interpretation of the pattern as UTF-8, UTF-16,
       or UTF-32, depending on which library is in use. In particular, it pre-
       vents  the  creator of the pattern from switching to UTF interpretation
       by starting the pattern with (*UTF). This may be useful in applications
       that  process  patterns  from  external  sources.  The  combination  of
       PCRE2_UTF and PCRE2_NEVER_UTF causes an error.

         PCRE2_NO_AUTO_CAPTURE

       If this option is set, it disables the use of numbered capturing paren-
       theses  in the pattern. Any opening parenthesis that is not followed by
       ? behaves as if it were followed by ?: but named parentheses can  still
       be  used  for  capturing  (and  they acquire numbers in the usual way).
       There is no equivalent of this option in Perl.

         PCRE2_NO_AUTO_POSSESS

       If this option is set, it disables "auto-possessification", which is an
       optimization  that,  for example, turns a+b into a++b in order to avoid
       backtracks into a+ that can never be successful. However,  if  callouts
       are  in  use,  auto-possessification means that some callouts are never
       taken. You can set this option if you want the matching functions to do
       a  full  unoptimized  search and run all the callouts, but it is mainly
       provided for testing purposes.

         PCRE2_NO_DOTSTAR_ANCHOR

       If this option is set, it disables an optimization that is applied when
       .*  is  the  first significant item in a top-level branch of a pattern,
       and all the other branches also start with .* or with \A or  \G  or  ^.
       The  optimization  is  automatically disabled for .* if it is inside an
       atomic group or a capturing group that is the subject of a back  refer-
       ence,  or  if  the pattern contains (*PRUNE) or (*SKIP). When the opti-
       mization is not disabled, such a pattern is automatically  anchored  if
       PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set
       for any ^ items. Otherwise, the fact that any match must  start  either
       at  the start of the subject or following a newline is remembered. Like
       other optimizations, this can cause callouts to be skipped.

         PCRE2_NO_START_OPTIMIZE

       This is an option whose main effect is at matching time.  It  does  not
       change what pcre2_compile() generates, but it does affect the output of
       the JIT compiler.

       There are a number of optimizations that may occur at the  start  of  a
       match,  in  order  to speed up the process. For example, if it is known
       that an unanchored match must start  with  a  specific  character,  the
       matching  code searches the subject for that character, and fails imme-
       diately if it cannot find it, without actually running the main  match-
       ing  function.  This means that a special item such as (*COMMIT) at the
       start of a pattern is not considered until after  a  suitable  starting
       point  for  the  match  has  been found. Also, when callouts or (*MARK)
       items are in use, these "start-up" optimizations can cause them  to  be
       skipped  if  the pattern is never actually used. The start-up optimiza-
       tions are in effect a pre-scan of the subject that takes  place  before
       the pattern is run.

       The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
       possibly causing performance to suffer,  but  ensuring  that  in  cases
       where  the  result is "no match", the callouts do occur, and that items
       such as (*COMMIT) and (*MARK) are considered at every possible starting
       position in the subject string.

       Setting  PCRE2_NO_START_OPTIMIZE  may  change the outcome of a matching
       operation.  Consider the pattern

         (*COMMIT)ABC

       When this is compiled, PCRE2 records the fact that a match  must  start
       with  the  character  "A".  Suppose the subject string is "DEFABC". The
       start-up optimization scans along the subject, finds "A" and  runs  the
       first  match attempt from there. The (*COMMIT) item means that the pat-
       tern must match the current starting position, which in this  case,  it
       does.  However,  if  the same match is run with PCRE2_NO_START_OPTIMIZE
       set, the initial scan along the subject string  does  not  happen.  The
       first  match  attempt  is  run  starting  from "D" and when this fails,
       (*COMMIT) prevents any further matches  being  tried,  so  the  overall
       result is "no match". There are also other start-up optimizations.  For
       example, a minimum length for the subject may be recorded. Consider the
       pattern

         (*MARK:A)(X|Y)

       The  minimum  length  for  a  match is one character. If the subject is
       "ABC", there will be attempts to match "ABC", "BC", and "C". An attempt
       to match an empty string at the end of the subject does not take place,
       because PCRE2 knows that the subject is  now  too  short,  and  so  the
       (*MARK)  is  never encountered. In this case, the optimization does not
       affect the overall match result, which is still "no match", but it does
       affect the auxiliary information that is returned.

         PCRE2_NO_UTF_CHECK

       When  PCRE2_UTF  is set, the validity of the pattern as a UTF string is
       automatically checked. There are  discussions  about  the  validity  of
       UTF-8  strings,  UTF-16 strings, and UTF-32 strings in the pcre2unicode
       document.  If an invalid UTF sequence is found, pcre2_compile() returns
       a negative error code.

       If you know that your pattern is valid, and you want to skip this check
       for performance reasons, you can  set  the  PCRE2_NO_UTF_CHECK  option.
       When  it  is set, the effect of passing an invalid UTF string as a pat-
       tern is undefined. It may cause your program to  crash  or  loop.  Note
       that   this   option   can   also   be   passed  to  pcre2_match()  and
       pcre_dfa_match(), to suppress validity checking of the subject string.

         PCRE2_UCP

       This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W,
       \w,  and  some  of  the POSIX character classes. By default, only ASCII
       characters are recognized, but if PCRE2_UCP is set, Unicode  properties
       are  used instead to classify characters. More details are given in the
       section on generic character types in the pcre2pattern page. If you set
       PCRE2_UCP,  matching one of the items it affects takes much longer. The
       option is available only if PCRE2 has been compiled with  Unicode  sup-
       port.

         PCRE2_UNGREEDY

       This  option  inverts  the "greediness" of the quantifiers so that they
       are not greedy by default, but become greedy if followed by "?". It  is
       not  compatible  with Perl. It can also be set by a (?U) option setting
       within the pattern.

         PCRE2_UTF

       This option causes PCRE2 to regard both the  pattern  and  the  subject
       strings  that  are  subsequently processed as strings of UTF characters
       instead of single-code-unit strings. It  is  available  when  PCRE2  is
       built  to  include  Unicode  support (which is the default). If Unicode
       support is not available, the use of this  option  provokes  an  error.
       Details  of how this option changes the behaviour of PCRE2 are given in
       the pcre2unicode page.


COMPILATION ERROR CODES

       There are over 80 positive error codes that pcre2_compile() may  return
       if it finds an error in the pattern. There are also some negative error
       codes that are used for invalid UTF strings.  These  are  the  same  as
       given  by pcre2_match() and pcre2_dfa_match(), and are described in the
       pcre2unicode page. The pcre2_get_error_message() function can be called
       to obtain a textual error message from any error code.


JUST-IN-TIME (JIT) COMPILATION

       int pcre2_jit_compile(pcre2_code *code, uint32_t options);

       int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
         PCRE2_SIZE length, PCRE2_SIZE startoffset,
         uint32_t options, pcre2_match_data *match_data,
         pcre2_match_context *mcontext);

       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);

       pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize,
         PCRE2_SIZE maxsize, pcre2_general_context *gcontext);

       void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
         pcre2_jit_callback callback_function, void *callback_data);

       void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);

       These  functions  provide  support  for  JIT compilation, which, if the
       just-in-time compiler is available, further processes a  compiled  pat-
       tern into machine code that executes much faster than the pcre2_match()
       interpretive matching function. Full details are given in the  pcre2jit
       documentation.

       JIT  compilation  is  a heavyweight optimization. It can take some time
       for patterns to be analyzed, and for one-off matches  and  simple  pat-
       terns  the benefit of faster execution might be offset by a much slower
       compilation time.  Most, but not all patterns can be optimized  by  the
       JIT compiler.


LOCALE SUPPORT

       PCRE2  handles caseless matching, and determines whether characters are
       letters, digits, or whatever, by reference to a set of tables,  indexed
       by  character  code  point.  This applies only to characters whose code
       points are less than 256. By default, higher-valued code  points  never
       match  escapes  such  as \w or \d.  However, if PCRE2 is built with UTF
       support, all characters can be tested with  \p  and  \P,  or,  alterna-
       tively,  the  PCRE2_UCP  option  can be set when a pattern is compiled;
       this causes \w and friends to use Unicode property support  instead  of
       the built-in tables.

       The  use  of  locales  with Unicode is discouraged. If you are handling
       characters with code points greater than 128,  you  should  either  use
       Unicode support, or use locales, but not try to mix the two.

       PCRE2  contains  an  internal  set of character tables that are used by
       default.  These are sufficient for  many  applications.  Normally,  the
       internal tables recognize only ASCII characters. However, when PCRE2 is
       built, it is possible to cause the internal tables to be rebuilt in the
       default "C" locale of the local system, which may cause them to be dif-
       ferent.

       The internal tables can be overridden by tables supplied by the  appli-
       cation  that  calls  PCRE2.  These may be created in a different locale
       from the default.  As more and more applications change to  using  Uni-
       code, the need for this locale support is expected to die away.

       External  tables  are built by calling the pcre2_maketables() function,
       in the relevant locale. The result can be passed to pcre2_compile()  as
       often   as  necessary,  by  creating  a  compile  context  and  calling
       pcre2_set_character_tables() to set the  tables  pointer  therein.  For
       example,  to  build  and use tables that are appropriate for the French
       locale (where accented characters with  values  greater  than  128  are
       treated as letters), the following code could be used:

         setlocale(LC_CTYPE, "fr_FR");
         tables = pcre2_maketables(NULL);
         ccontext = pcre2_compile_context_create(NULL);
         pcre2_set_character_tables(ccontext, tables);
         re = pcre2_compile(..., ccontext);

       The  locale  name "fr_FR" is used on Linux and other Unix-like systems;
       if you are using Windows, the name for the French locale  is  "french".
       It  is the caller's responsibility to ensure that the memory containing
       the tables remains available for as long as it is needed.

       The pointer that is passed (via the compile context) to pcre2_compile()
       is  saved  with  the  compiled pattern, and the same tables are used by
       pcre2_match() and pcre_dfa_match(). Thus, for any single pattern,  com-
       pilation,  and  matching  all  happen in the same locale, but different
       patterns can be processed in different locales.


INFORMATION ABOUT A COMPILED PATTERN

       int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where);

       The pcre2_pattern_info() function returns information about a  compiled
       pattern.  The  first argument is a pointer to the compiled pattern. The
       second argument specifies which piece of information is  required,  and
       the  third  argument is a pointer to a variable to receive the data. If
       the third argument is NULL, the first  argument  is  ignored,  and  the
       function returns the size in bytes of the variable that is required for
       the information requested.  Otherwise, The yield  of  the  function  is
       zero for success, or one of the following negative numbers:

         PCRE2_ERROR_NULL           the argument code was NULL
         PCRE2_ERROR_BADMAGIC       the "magic number" was not found
         PCRE2_ERROR_BADOPTION      the value of what was invalid
         PCRE2_ERROR_UNSET          the requested field is not set

       The  "magic  number" is placed at the start of each compiled pattern as
       an simple check against passing an arbitrary memory pointer. Here is  a
       typical  call of pcre2_pattern_info(), to obtain the length of the com-
       piled pattern:

         int rc;
         size_t length;
         rc = pcre2_pattern_info(
           re,               /* result of pcre2_compile() */
           PCRE2_INFO_SIZE,  /* what is required */
           &length);         /* where to put the data */

       The possible values for the second argument are defined in pcre2.h, and
       are as follows:

         PCRE2_INFO_ALLOPTIONS
         PCRE2_INFO_ARGOPTIONS

       Return a copy of the pattern's options. The third argument should point
       to a  uint32_t  variable.  PCRE2_INFO_ARGOPTIONS  returns  exactly  the
       options  that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP-
       TIONS returns the compile options as modified by any  top-level  option
       settings  at  the start of the pattern itself. In other words, they are
       the options that will be in force when matching starts. For example, if
       the  pattern  /(?im)abc(?-i)d/  is  compiled  with  the  PCRE2_EXTENDED
       option,   the   result   is   PCRE2_CASELESS,   PCRE2_MULTILINE,    and
       PCRE2_EXTENDED.

       A  pattern compiled without PCRE2_ANCHORED is automatically anchored by
       PCRE2 if the first significant item in every top-level branch is one of
       the following:

         ^     unless PCRE2_MULTILINE is set
         \A    always
         \G    always
         .*    sometimes - see below

       When  .* is the first significant item, anchoring is possible only when
       all the following are true:

         .* is not in an atomic group
         .* is not in a capturing group that is the subject
              of a back reference
         PCRE2_DOTALL is in force for .*
         Neither (*PRUNE) nor (*SKIP) appears in the pattern.
         PCRE2_NO_DOTSTAR_ANCHOR is not set.

       For patterns that are auto-anchored, the PCRE2_ANCHORED bit is  set  in
       the options returned for PCRE2_INFO_ALLOPTIONS.

         PCRE2_INFO_BACKREFMAX

       Return  the  number  of  the highest back reference in the pattern. The
       third argument should point to an uint32_t variable. Zero  is  returned
       if there are no back references.

         PCRE2_INFO_BSR

       The output is a uint32_t whose value indicates what character sequences
       the \R escape sequence matches. A value of PCRE2_BSR_UNICODE means that
       \R  matches any Unicode line ending sequence; a value of PCRE2_BSR_ANY-
       CRLF means that \R matches only CR, LF, or CRLF.

         PCRE2_INFO_CAPTURECOUNT

       Return the number of capturing subpatterns in the  pattern.  The  third
       argument should point to an uint32_t variable.

         PCRE2_INFO_FIRSTCODETYPE

       Return information about the first code unit of any matched string, for
       a non-anchored pattern. The third argument should point to an  uint32_t
       variable.

       If  there  is  a  fixed first value, for example, the letter "c" from a
       pattern such as (cat|cow|coyote), 1  is  returned,  and  the  character
       value  can  be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no
       fixed first value, but it is known that a match can occur only  at  the
       start  of  the  subject  or  following  a  newline in the subject, 2 is
       returned. Otherwise, and for anchored patterns, 0 is returned.

         PCRE2_INFO_FIRSTCODEUNIT

       Return the value of the first code unit of any matched  string  in  the
       situation where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0.
       The third argument should point to an uint32_t variable. In  the  8-bit
       library,  the  value is always less than 256. In the 16-bit library the
       value can be up to 0xffff. In the 32-bit library  in  UTF-32  mode  the
       value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
       mode.

         PCRE2_INFO_FIRSTBITMAP

       In the absence of a single first code unit for a non-anchored  pattern,
       pcre2_compile()  may construct a 256-bit table that defines a fixed set
       of values for the first code unit in any match. For example, a  pattern
       that  starts  with  [abc]  results in a table with three bits set. When
       code unit values greater than 255 are supported, the flag bit  for  255
       means  "any  code unit of value 255 or above". If such a table was con-
       structed, a pointer to it is returned. Otherwise NULL is returned.  The
       third argument should point to an const uint8_t * variable.

         PCRE2_INFO_HASCRORLF

       Return  1  if  the  pattern  contains any explicit matches for CR or LF
       characters, otherwise 0. The third argument should point to an uint32_t
       variable.  An explicit match is either a literal CR or LF character, or
       \r or \n.

         PCRE2_INFO_JCHANGED

       Return 1 if the (?J) or (?-J) option setting is used  in  the  pattern,
       otherwise  0.  The third argument should point to an uint32_t variable.
       (?J) and (?-J) set and unset the local PCRE2_DUPNAMES  option,  respec-
       tively.

         PCRE2_INFO_JITSIZE

       If  the  compiled  pattern was successfully processed by pcre2_jit_com-
       pile(), return the size of the  JIT  compiled  code,  otherwise  return
       zero. The third argument should point to a size_t variable.

         PCRE2_INFO_LASTCODETYPE

       Returns  1 if there is a rightmost literal code unit that must exist in
       any matched string, other than at its start. The third argument  should
       point  to  an  uint32_t  variable.  If  there  is  no  such value, 0 is
       returned. When 1 is  returned,  the  code  unit  value  itself  can  be
       retrieved using PCRE2_INFO_LASTCODEUNIT.

       For anchored patterns, a last literal value is recorded only if it fol-
       lows something  of  variable  length.  For  example,  for  the  pattern
       /^a\d+z\d+/   the   returned   value  is  1  (with  "z"  returned  from
       PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is 0.

         PCRE2_INFO_LASTCODEUNIT

       Return the value of the rightmost literal data unit that must exist  in
       any  matched  string, other than at its start, if such a value has been
       recorded. The third argument should point to an uint32_t  variable.  If
       there is no such value, 0 is returned.

         PCRE2_INFO_MATCHEMPTY

       Return  1  if  the  pattern can match an empty string, otherwise 0. The
       third argument should point to an uint32_t variable.

         PCRE2_INFO_MATCHLIMIT

       If the pattern set a match limit by  including  an  item  of  the  form
       (*LIMIT_MATCH=nnnn)  at  the  start,  the  value is returned. The third
       argument should point to an unsigned 32-bit integer. If no  such  value
       has  been  set,  the  call  to  pcre2_pattern_info()  returns the error
       PCRE2_ERROR_UNSET.

         PCRE2_INFO_MAXLOOKBEHIND

       Return the number of characters (not code units) in the longest lookbe-
       hind  assertion  in  the pattern. The third argument should point to an
       unsigned 32-bit integer. This information is useful when  doing  multi-
       segment  matching  using the partial matching facilities. Note that the
       simple assertions \b and \B require a one-character lookbehind. \A also
       registers  a  one-character  lookbehind,  though  it  does not actually
       inspect the previous character. This is to ensure  that  at  least  one
       character  from  the old segment is retained when a new segment is pro-
       cessed. Otherwise, if there are no lookbehinds in the pattern, \A might
       match incorrectly at the start of a new segment.

         PCRE2_INFO_MINLENGTH

       If  a  minimum  length  for  matching subject strings was computed, its
       value is returned. Otherwise the returned value is 0. The  value  is  a
       number  of characters, which in UTF mode may be different from the num-
       ber of code units.  The third argument  should  point  to  an  uint32_t
       variable.  The  value  is  a  lower bound to the length of any matching
       string. There may not be any strings of that length  that  do  actually
       match, but every string that does match is at least that long.

         PCRE2_INFO_NAMECOUNT
         PCRE2_INFO_NAMEENTRYSIZE
         PCRE2_INFO_NAMETABLE

       PCRE2 supports the use of named as well as numbered capturing parenthe-
       ses. The names are just an additional way of identifying the  parenthe-
       ses, which still acquire numbers. Several convenience functions such as
       pcre2_substring_get_byname() are provided for extracting captured  sub-
       strings  by  name. It is also possible to extract the data directly, by
       first converting the name to a number in order to  access  the  correct
       pointers  in the output vector (described with pcre2_match() below). To
       do the conversion, you need to use the  name-to-number  map,  which  is
       described by these three values.

       The  map  consists  of a number of fixed-size entries. PCRE2_INFO_NAME-
       COUNT gives the number of entries, and  PCRE2_INFO_NAMEENTRYSIZE  gives
       the  size  of each entry in code units; both of these return a uint32_t
       value. The entry size depends on the length of the longest name.

       PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table.
       This  is  a  PCRE2_SPTR  pointer to a block of code units. In the 8-bit
       library, the first two bytes of each entry are the number of  the  cap-
       turing parenthesis, most significant byte first. In the 16-bit library,
       the pointer points to 16-bit code units, the first  of  which  contains
       the  parenthesis  number.  In the 32-bit library, the pointer points to
       32-bit code units, the first of which contains the parenthesis  number.
       The rest of the entry is the corresponding name, zero terminated.

       The  names are in alphabetical order. If (?| is used to create multiple
       groups with the same number, as described in the section  on  duplicate
       subpattern  numbers  in  the pcre2pattern page, the groups may be given
       the same name, but there is only one  entry  in  the  table.  Different
       names for groups of the same number are not permitted.

       Duplicate  names  for subpatterns with different numbers are permitted,
       but only if PCRE2_DUPNAMES is set. They appear  in  the  table  in  the
       order  in  which  they were found in the pattern. In the absence of (?|
       this is the order of increasing number; when (?| is used  this  is  not
       necessarily the case because later subpatterns may have lower numbers.

       As  a  simple  example of the name/number table, consider the following
       pattern after compilation by the 8-bit library  (assume  PCRE2_EXTENDED
       is set, so white space - including newlines - is ignored):

         (?<date> (?<year>(\d\d)?\d\d) -
         (?<month>\d\d) - (?<day>\d\d) )

       There  are  four  named subpatterns, so the table has four entries, and
       each entry in the table is eight bytes long. The table is  as  follows,
       with non-printing bytes shows in hexadecimal, and undefined bytes shown
       as ??:

         00 01 d  a  t  e  00 ??
         00 05 d  a  y  00 ?? ??
         00 04 m  o  n  t  h  00
         00 02 y  e  a  r  00 ??

       When writing code to extract data  from  named  subpatterns  using  the
       name-to-number  map,  remember that the length of the entries is likely
       to be different for each compiled pattern.

         PCRE2_INFO_NEWLINE

       The output is a uint32_t with one of the following values:

         PCRE2_NEWLINE_CR       Carriage return (CR)
         PCRE2_NEWLINE_LF       Linefeed (LF)
         PCRE2_NEWLINE_CRLF     Carriage return, linefeed (CRLF)
         PCRE2_NEWLINE_ANY      Any Unicode line ending
         PCRE2_NEWLINE_ANYCRLF  Any of CR, LF, or CRLF

       This specifies the default character sequence that will  be  recognized
       as meaning "newline" while matching.

         PCRE2_INFO_RECURSIONLIMIT

       If  the  pattern set a recursion limit by including an item of the form
       (*LIMIT_RECURSION=nnnn) at the start, the value is returned. The  third
       argument  should  point to an unsigned 32-bit integer. If no such value
       has been set,  the  call  to  pcre2_pattern_info()  returns  the  error
       PCRE2_ERROR_UNSET.

         PCRE2_INFO_SIZE

       Return  the  size  of  the  compiled  pattern  in  bytes (for all three
       libraries). The third argument should point to a size_t variable.  This
       value  does  not  include  the size of the pcre2_code structure that is
       returned by pcre_compile(). The value that is used when pcre2_compile()
       is  getting  memory  in  which  to place the compiled data is the value
       returned by this option plus the size of the pcre2_code structure. Pro-
       cessing  a  pattern  with  the  JIT  compiler  does not alter the value
       returned by this option.


THE MATCH DATA BLOCK

       pcre2_match_data_create(uint32_t ovecsize,
         pcre2_general_context *gcontext);

       pcre2_match_data_create_from_pattern(const pcre2_code *code,
         pcre2_general_context *gcontext);

       void pcre2_match_data_free(pcre2_match_data *match_data);

       Information about a successful or unsuccessful match  is  placed  in  a
       match  data  block,  which  is  an opaque structure that is accessed by
       function calls. In particular, the match data block contains  a  vector
       of  offsets into the subject string that define the matched part of the
       subject and any substrings that were captured.  This  is  know  as  the
       ovector.

       Before  calling  pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match()
       you must create a match data block by calling one of the creation func-
       tions  above.  For pcre2_match_data_create(), the first argument is the
       number of pairs of offsets in the  ovector.  One  pair  of  offsets  is
       required  to  identify  the string that matched the whole pattern, with
       another pair for each captured substring. For example,  a  value  of  4
       creates  enough space to record the matched portion of the subject plus
       three captured substrings. A minimum of at least 1 pair is  imposed  by
       pcre2_match_data_create(), so it is always possible to return the over-
       all matched string.

       The second argument of pcre2_match_data_create() is a pointer to a gen-
       eral  context, which can specify custom memory management for obtaining
       the memory for the match data block. If you are not using custom memory
       management, pass NULL, which causes malloc() to be used.

       For  pcre2_match_data_create_from_pattern(),  the  first  argument is a
       pointer to a compiled pattern. The ovector is created to be exactly the
       right size to hold all the substrings a pattern might capture. The sec-
       ond argument is again a pointer to a general context, but in this  case
       if NULL is passed, the memory is obtained using the same allocator that
       was used for the compiled pattern (custom or default).

       A match data block can be used many times, with the same  or  different
       compiled  patterns. You can extract information from a match data block
       after  a  match  operation  has  finished,  using  functions  that  are
       described  in  the  sections  on  matched  strings and other match data
       below.

       When a call of pcre2_match() fails, valid  data  is  available  in  the
       match    block    only   when   the   error   is   PCRE2_ERROR_NOMATCH,
       PCRE2_ERROR_PARTIAL, or one of the  error  codes  for  an  invalid  UTF
       string. Exactly what is available depends on the error, and is detailed
       below.

       When one of the matching functions is called, pointers to the  compiled
       pattern  and the subject string are set in the match data block so that
       they can be referenced by the extraction  functions.  After  running  a
       match,  you  must not free a compiled pattern or a subject string until
       after all operations on the match data  block  (for  that  match)  have
       taken place.

       When  a match data block itself is no longer needed, it should be freed
       by calling pcre2_match_data_free().


MATCHING A PATTERN: THE TRADITIONAL FUNCTION

       int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
         PCRE2_SIZE length, PCRE2_SIZE startoffset,
         uint32_t options, pcre2_match_data *match_data,
         pcre2_match_context *mcontext);

       The function pcre2_match() is called to match a subject string  against
       a  compiled pattern, which is passed in the code argument. You can call
       pcre2_match() with the same code argument as many times as you like, in
       order  to  find multiple matches in the subject string or to match dif-
       ferent subject strings with the same pattern.

       This function is the main matching facility  of  the  library,  and  it
       operates  in  a  Perl-like  manner. For specialist use there is also an
       alternative matching function, which is described below in the  section
       about the pcre2_dfa_match() function.

       Here is an example of a simple call to pcre2_match():

         pcre2_match_data *md = pcre2_match_data_create(4, NULL);
         int rc = pcre2_match(
           re,             /* result of pcre2_compile() */
           "some string",  /* the subject string */
           11,             /* the length of the subject string */
           0,              /* start at offset 0 in the subject */
           0,              /* default options */
           match_data,     /* the match data block */
           NULL);          /* a match context; NULL means use defaults */

       If  the  subject  string is zero-terminated, the length can be given as
       PCRE2_ZERO_TERMINATED. A match context must be provided if certain less
       common matching parameters are to be changed. For details, see the sec-
       tion on the match context above.

   The string to be matched by pcre2_match()

       The subject string is passed to pcre2_match() as a pointer in  subject,
       a  length  in  length, and a starting offset in startoffset. The length
       and offset are in code units, not characters.  That  is,  they  are  in
       bytes  for the 8-bit library, 16-bit code units for the 16-bit library,
       and 32-bit code units for the 32-bit library, whether or not  UTF  pro-
       cessing is enabled.

       If startoffset is greater than the length of the subject, pcre2_match()
       returns PCRE2_ERROR_BADOFFSET. When the starting offset  is  zero,  the
       search  for a match starts at the beginning of the subject, and this is
       by far the most common case. In UTF-8 or UTF-16 mode, the starting off-
       set  must  point to the start of a character, or to the end of the sub-
       ject (in UTF-32 mode, one code unit equals one character, so  all  off-
       sets  are  valid).  Like  the  pattern  string, the subject may contain
       binary zeroes.

       A non-zero starting offset is useful when searching for  another  match
       in  the  same  subject  by calling pcre2_match() again after a previous
       success.  Setting startoffset differs from  passing  over  a  shortened
       string  and  setting  PCRE2_NOTBOL in the case of a pattern that begins
       with any kind of lookbehind. For example, consider the pattern

         \Biss\B

       which finds occurrences of "iss" in the middle of  words.  (\B  matches
       only  if  the  current position in the subject is not a word boundary.)
       When applied to the string "Mississipi" the first call to pcre2_match()
       finds  the first occurrence. If pcre2_match() is called again with just
       the remainder of the subject,  namely  "issipi",  it  does  not  match,
       because \B is always false at the start of the subject, which is deemed
       to be a word boundary. However, if pcre2_match() is passed  the  entire
       string again, but with startoffset set to 4, it finds the second occur-
       rence of "iss" because it is able to look behind the starting point  to
       discover that it is preceded by a letter.

       Finding  all  the  matches  in a subject is tricky when the pattern can
       match an empty string. It is possible to emulate Perl's /g behaviour by
       first   trying   the   match   again  at  the  same  offset,  with  the
       PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED options,  and  then  if  that
       fails,  advancing  the  starting  offset  and  trying an ordinary match
       again. There is some code that demonstrates  how  to  do  this  in  the
       pcre2demo  sample  program. In the most general case, you have to check
       to see if the newline convention recognizes CRLF as a newline,  and  if
       so,  and the current character is CR followed by LF, advance the start-
       ing offset by two characters instead of one.

       If a non-zero starting offset is passed when the pattern  is  anchored,
       one attempt to match at the given offset is made. This can only succeed
       if the pattern does not require the match to be at  the  start  of  the
       subject.

   Option bits for pcre2_match()

       The unused bits of the options argument for pcre2_match() must be zero.
       The only  bits  that  may  be  set  are  PCRE2_ANCHORED,  PCRE2_NOTBOL,
       PCRE2_NOTEOL,          PCRE2_NOTEMPTY,          PCRE2_NOTEMPTY_ATSTART,
       PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and  PCRE2_PARTIAL_SOFT.  Their
       action is described below.

       Setting  PCRE2_ANCHORED  at match time is not supported by the just-in-
       time (JIT) compiler. If it is set, JIT matching  is  disabled  and  the
       normal interpretive code in pcre2_match() is run. The remaining options
       are supported for JIT matching.

         PCRE2_ANCHORED

       The PCRE2_ANCHORED option limits pcre2_match() to matching at the first
       matching  position.  If  a pattern was compiled with PCRE2_ANCHORED, or
       turned out to be anchored by virtue of its contents, it cannot be  made
       unachored  at matching time. Note that setting the option at match time
       disables JIT matching.

         PCRE2_NOTBOL

       This option specifies that first character of the subject string is not
       the  beginning  of  a  line, so the circumflex metacharacter should not
       match before it. Setting this without  having  set  PCRE2_MULTILINE  at
       compile time causes circumflex never to match. This option affects only
       the behaviour of the circumflex metacharacter. It does not affect \A.

         PCRE2_NOTEOL

       This option specifies that the end of the subject string is not the end
       of  a line, so the dollar metacharacter should not match it nor (except
       in multiline mode) a newline immediately before it. Setting this  with-
       out  having  set PCRE2_MULTILINE at compile time causes dollar never to
       match. This option affects only the behaviour of the dollar metacharac-
       ter. It does not affect \Z or \z.

         PCRE2_NOTEMPTY

       An empty string is not considered to be a valid match if this option is
       set. If there are alternatives in the pattern, they are tried.  If  all
       the  alternatives  match  the empty string, the entire match fails. For
       example, if the pattern

         a?b?

       is applied to a string not beginning with "a" or  "b",  it  matches  an
       empty string at the start of the subject. With PCRE2_NOTEMPTY set, this
       match is not valid, so pcre2_match() searches further into  the  string
       for occurrences of "a" or "b".

         PCRE2_NOTEMPTY_ATSTART

       This  is  like PCRE2_NOTEMPTY, except that it locks out an empty string
       match only at the first matching position, that is, at the start of the
       subject  plus  the  starting offset. An empty string match later in the
       subject is permitted.  If the pattern is anchored,  such  a  match  can
       occur only if the pattern contains \K.

         PCRE2_NO_UTF_CHECK

       When PCRE2_UTF is set at compile time, the validity of the subject as a
       UTF string is checked by default  when  pcre2_match()  is  subsequently
       called.  The entire string is checked before any other processing takes
       place, and a negative error code is returned if the check fails.  There
       are  several UTF error codes for each code unit width, corresponding to
       different problems with the code unit sequence. The value of  startoff-
       set is also checked, to ensure that it points to the start of a charac-
       ter or to the end of the  subject.  There  are  discussions  about  the
       validity  of  UTF-8  strings, UTF-16 strings, and UTF-32 strings in the
       pcre2unicode page.

       If you know that your subject is valid, and  you  want  to  skip  these
       checks  for  performance  reasons,  you  can set the PCRE2_NO_UTF_CHECK
       option when calling pcre2_match(). You might want to do  this  for  the
       second and subsequent calls to pcre2_match() if you are making repeated
       calls to find all the matches in a single subject string.

       NOTE: When PCRE2_NO_UTF_CHECK is set, the effect of passing an  invalid
       string  as a subject, or an invalid value of startoffset, is undefined.
       Your program may crash or loop indefinitely.

         PCRE2_PARTIAL_HARD
         PCRE2_PARTIAL_SOFT

       These options turn on the partial matching  feature.  A  partial  match
       occurs  if  the  end of the subject string is reached successfully, but
       there are not enough subject characters to complete the match. If  this
       happens  when  PCRE2_PARTIAL_SOFT  (but not PCRE2_PARTIAL_HARD) is set,
       matching continues by testing any remaining alternatives.  Only  if  no
       complete  match can be found is PCRE2_ERROR_PARTIAL returned instead of
       PCRE2_ERROR_NOMATCH. In other words, PCRE2_PARTIAL_SOFT specifies  that
       the  caller  is prepared to handle a partial match, but only if no com-
       plete match can be found.

       If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In  this
       case,  if  a  partial match is found, pcre2_match() immediately returns
       PCRE2_ERROR_PARTIAL, without considering  any  other  alternatives.  In
       other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid-
       ered to be more important that an alternative complete match.

       There is a more detailed discussion of partial and multi-segment match-
       ing, with examples, in the pcre2partial documentation.


NEWLINE HANDLING WHEN MATCHING

       When  PCRE2 is built, a default newline convention is set; this is usu-
       ally the standard convention for the operating system. The default  can
       be  overridden  in  a  compile  context.   During matching, the newline
       choice affects  the  behaviour  of  the  dot,  circumflex,  and  dollar
       metacharacters.  It  may also alter the way the match starting position
       is advanced after a match failure for an unanchored pattern.

       When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is
       set  as  the  newline convention, and a match attempt for an unanchored
       pattern fails when the current starting position is at a CRLF sequence,
       and  the  pattern contains no explicit matches for CR or LF characters,
       the match position is advanced by two characters  instead  of  one,  in
       other words, to after the CRLF.

       The above rule is a compromise that makes the most common cases work as
       expected. For example, if the pattern  is  .+A  (and  the  PCRE2_DOTALL
       option is not set), it does not match the string "\r\nA" because, after
       failing at the start, it skips both the CR and the LF before  retrying.
       However,  the  pattern  [\r\n]A does match that string, because it con-
       tains an explicit CR or LF reference, and so advances only by one char-
       acter after the first failure.

       An explicit match for CR of LF is either a literal appearance of one of
       those characters in the  pattern,  or  one  of  the  \r  or  \n  escape
       sequences.  Implicit  matches  such  as [^X] do not count, nor does \s,
       even though it includes CR and LF in the characters that it matches.

       Notwithstanding the above, anomalous effects may still occur when  CRLF
       is a valid newline sequence and explicit \r or \n escapes appear in the
       pattern.


HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS

       uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);

       PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);

       In general, a pattern matches a certain portion of the subject, and  in
       addition,  further  substrings  from  the  subject may be picked out by
       parenthesized parts of the pattern.  Following  the  usage  in  Jeffrey
       Friedl's  book,  this  is  called  "capturing" in what follows, and the
       phrase "capturing subpattern" or "capturing group" is used for a  frag-
       ment  of  a  pattern that picks out a substring. PCRE2 supports several
       other kinds of parenthesized subpattern that do not cause substrings to
       be  captured. The pcre2_pattern_info() function can be used to find out
       how many capturing subpatterns there are in a compiled pattern.

       A successful match returns the overall matched string and any  captured
       substrings  to  the  caller  via a vector of PCRE2_SIZE values. This is
       called the ovector, and is contained within the match data block.   You
       can  obtain  direct  access  to  the ovector by calling pcre2_get_ovec-
       tor_pointer() to find its  address,  and  pcre2_get_ovector_count()  to
       find  the number of pairs of values it contains. Alternatively, you can
       use the auxiliary functions for accessing captured substrings by number
       or by name (see below).

       Within the ovector, the first in each pair of values is set to the off-
       set of the first code unit of a substring, and the second is set to the
       offset  of the first code unit after the end of a substring. These val-
       ues are always code unit offsets, not character offsets. That is,  they
       are  byte  offsets  in  the 8-bit library, 16-bit offsets in the 16-bit
       library, and 32-bit offsets in the 32-bit library.

       After a partial match  (error  return  PCRE2_ERROR_PARTIAL),  only  the
       first  pair  of  offsets  (that is, ovector[0] and ovector[1]) are set.
       They identify the part of the subject that was partially  matched.  See
       the pcre2partial documentation for details of partial matching.

       After a successful match, the first pair of offsets identifies the por-
       tion of the subject string that was matched by the entire pattern.  The
       next  pair  is  used for the first capturing subpattern, and so on. The
       value returned by pcre2_match() is one more than the  highest  numbered
       pair  that  has been set. For example, if two substrings have been cap-
       tured, the returned value is 3. If there are no capturing  subpatterns,
       the return value from a successful match is 1, indicating that just the
       first pair of offsets has been set.

       If a pattern uses the \K escape sequence within a  positive  assertion,
       the reported start of a successful match can be greater than the end of
       the match.  For example, if the pattern  (?=ab\K)  is  matched  against
       "ab", the start and end offset values for the match are 2 and 0.

       If  a  capturing subpattern group is matched repeatedly within a single
       match operation, it is the last portion of the subject that it  matched
       that is returned.

       If the ovector is too small to hold all the captured substring offsets,
       as much as possible is filled in, and the function returns a  value  of
       zero.  If captured substrings are not of interest, pcre2_match() may be
       called with a match data block whose ovector is of minimum length (that
       is, one pair). However, if the pattern contains back references and the
       ovector is not big enough to remember the related substrings, PCRE2 has
       to  get  additional  memory for use during matching. Thus it is usually
       advisable to set up a match data block containing an ovector of reason-
       able size.

       It  is  possible for capturing subpattern number n+1 to match some part
       of the subject when subpattern n has not been used at all. For example,
       if  the  string  "abc"  is  matched against the pattern (a|(z))(bc) the
       return from the function is 4, and subpatterns 1 and 3 are matched, but
       2  is  not.  When  this happens, both values in the offset pairs corre-
       sponding to unused subpatterns are set to PCRE2_UNSET.

       Offset values that correspond to unused subpatterns at the end  of  the
       expression  are  also  set  to  PCRE2_UNSET. For example, if the string
       "abc" is matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3
       are  not matched.  The return from the function is 2, because the high-
       est used capturing subpattern number is 1. The offsets for for the sec-
       ond  and  third  capturing  subpatterns  (assuming  the vector is large
       enough, of course) are set to PCRE2_UNSET.

       Elements in the ovector that do not correspond to capturing parentheses
       in the pattern are never changed. That is, if a pattern contains n cap-
       turing parentheses, no more than ovector[0] to ovector[2n+1] are set by
       pcre2_match().  The  other  elements retain whatever values they previ-
       ously had.


OTHER INFORMATION ABOUT A MATCH

       PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);

       PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);

       As well as the offsets in the ovector, other information about a  match
       is  retained  in the match data block and can be retrieved by the above
       functions in appropriate circumstances. If they  are  called  at  other
       times, the result is undefined.

       After  a  successful match, a partial match (PCRE2_ERROR_PARTIAL), or a
       failure to match (PCRE2_ERROR_NOMATCH), a (*MARK) name  may  be  avail-
       able,  and  pcre2_get_mark() can be called. It returns a pointer to the
       zero-terminated name, which is within the compiled  pattern.  Otherwise
       NULL  is  returned.  After a successful match, the (*MARK) name that is
       returned is the last one encountered on the matching path  through  the
       pattern.  After  a  "no match" or a partial match, the last encountered
       (*MARK) name is returned. For example, consider this pattern:

         ^(*MARK:A)((*MARK:B)a|b)c

       When it matches "bc", the returned mark is A. The B mark is  "seen"  in
       the  first  branch of the group, but it is not on the matching path. On
       the other hand, when this pattern fails to  match  "bx",  the  returned
       mark is B.

       After  a  successful  match, a partial match, or one of the invalid UTF
       errors (for example, PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar()  can
       be called. After a successful or partial match it returns the code unit
       offset of the character at which the match started. For  a  non-partial
       match,  this can be different to the value of ovector[0] if the pattern
       contains the \K escape sequence. After a partial match,  however,  this
       value  is  always the same as ovector[0] because \K does not affect the
       result of a partial match.

       After a UTF check failure, pcre2_get_startchar() can be used to  obtain
       the code unit offset of the invalid UTF character. Details are given in
       the pcre2unicode page.


ERROR RETURNS FROM pcre2_match()

       If pcre2_match() fails, it returns a negative number. This can be  con-
       verted  to a text string by calling pcre2_get_error_message(). Negative
       error codes are also returned by other functions,  and  are  documented
       with them.  The codes are given names in the header file. If UTF check-
       ing is in force and an invalid UTF subject string is detected, one of a
       number  of  UTF-specific  negative error codes is returned. Details are
       given in the pcre2unicode page. The following are the other errors that
       may be returned by pcre2_match():

         PCRE2_ERROR_NOMATCH

       The subject string did not match the pattern.

         PCRE2_ERROR_PARTIAL

       The  subject  string did not match, but it did match partially. See the
       pcre2partial documentation for details of partial matching.

         PCRE2_ERROR_BADMAGIC

       PCRE2 stores a 4-byte "magic number" at the start of the compiled code,
       to  catch  the case when it is passed a junk pointer. This is the error
       that is returned when the magic number is not present.

         PCRE2_ERROR_BADMODE

       This error is given when a pattern  that  was  compiled  by  the  8-bit
       library  is  passed  to  a  16-bit  or 32-bit library function, or vice
       versa.

         PCRE2_ERROR_BADOFFSET

       The value of startoffset was greater than the length of the subject.

         PCRE2_ERROR_BADOPTION

       An unrecognized bit was set in the options argument.

         PCRE2_ERROR_BADUTFOFFSET

       The UTF code unit sequence that was passed as a subject was checked and
       found  to be valid (the PCRE2_NO_UTF_CHECK option was not set), but the
       value of startoffset did not point to the beginning of a UTF  character
       or the end of the subject.

         PCRE2_ERROR_CALLOUT

       This  error  is never generated by pcre2_match() itself. It is provided
       for use by callout functions that want to cause pcre2_match() to return
       a  distinctive  error  code.  See  the  pcre2callout  documentation for
       details.

         PCRE2_ERROR_INTERNAL

       An unexpected internal error has occurred. This error could  be  caused
       by a bug in PCRE2 or by overwriting of the compiled pattern.

         PCRE2_ERROR_JIT_BADOPTION

       This  error  is  returned  when a pattern that was successfully studied
       using JIT is being matched, but the matching mode (partial or  complete
       match)  does  not  correspond to any JIT compilation mode. When the JIT
       fast path function is used, this error may be also  given  for  invalid
       options. See the pcre2jit documentation for more details.

         PCRE2_ERROR_JIT_STACKLIMIT

       This  error  is  returned  when a pattern that was successfully studied
       using JIT is being matched, but the memory available for  the  just-in-
       time  processing stack is not large enough. See the pcre2jit documenta-
       tion for more details.

         PCRE2_ERROR_MATCHLIMIT

       The backtracking limit was reached.

         PCRE2_ERROR_NOMEMORY

       If a pattern contains back references,  but  the  ovector  is  not  big
       enough  to  remember  the  referenced substrings, PCRE2 gets a block of
       memory at the start of matching to use for this purpose. There are some
       other  special cases where extra memory is needed during matching. This
       error is given when memory cannot be obtained.

         PCRE2_ERROR_NULL

       Either the code, subject, or match_data argument was passed as NULL.

         PCRE2_ERROR_RECURSELOOP

       This error is returned when  pcre2_match()  detects  a  recursion  loop
       within  the  pattern. Specifically, it means that either the whole pat-
       tern or a subpattern has been called recursively for the second time at
       the  same  position  in  the  subject string. Some simple patterns that
       might do this are detected and faulted at compile time, but  more  com-
       plicated  cases,  in particular mutual recursions between two different
       subpatterns, cannot be detected until matching is attempted.

         PCRE2_ERROR_RECURSIONLIMIT

       The internal recursion limit was reached.


EXTRACTING CAPTURED SUBSTRINGS BY NUMBER

       int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
         uint32_t number, PCRE2_SIZE *length);

       int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
         uint32_t number, PCRE2_UCHAR *buffer,
         PCRE2_SIZE *bufflen);

       int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
         uint32_t number, PCRE2_UCHAR **bufferptr,
         PCRE2_SIZE *bufflen);

       void pcre2_substring_free(PCRE2_UCHAR *buffer);

       Captured substrings can be accessed directly by using  the  ovector  as
       described above.  For convenience, auxiliary functions are provided for
       extracting  captured  substrings  as  new,  separate,   zero-terminated
       strings. A substring that contains a binary zero is correctly extracted
       and has a further zero added on the end, but  the  result  is  not,  of
       course, a C string.

       The functions in this section identify substrings by number. The number
       zero refers to the entire matched substring, with higher numbers refer-
       ring  to  substrings  captured by parenthesized groups. After a partial
       match, only substring zero is available.  An  attempt  to  extract  any
       other  substring  gives the error PCRE2_ERROR_PARTIAL. The next section
       describes similar functions for extracting captured substrings by name.

       If a pattern uses the \K escape sequence within a  positive  assertion,
       the reported start of a successful match can be greater than the end of
       the match.  For example, if the pattern  (?=ab\K)  is  matched  against
       "ab",  the  start  and  end offset values for the match are 2 and 0. In
       this situation, calling these functions with a  zero  substring  number
       extracts a zero-length empty string.

       You  can  find the length in code units of a captured substring without
       extracting it by calling pcre2_substring_length_bynumber().  The  first
       argument  is a pointer to the match data block, the second is the group
       number, and the third is a pointer to a variable into which the  length
       is  placed.  If  you just want to know whether or not the substring has
       been captured, you can pass the third argument as NULL.

       The pcre2_substring_copy_bynumber() function  copies  a  captured  sub-
       string  into  a supplied buffer, whereas pcre2_substring_get_bynumber()
       copies it into new memory, obtained using the  same  memory  allocation
       function  that  was  used for the match data block. The first two argu-
       ments of these functions are a pointer to the match data  block  and  a
       capturing group number.

       The final arguments of pcre2_substring_copy_bynumber() are a pointer to
       the buffer and a pointer to a variable that contains its length in code
       units.  This is updated to contain the actual number of code units used
       for the extracted substring, excluding the terminating zero.

       For pcre2_substring_get_bynumber() the third and fourth arguments point
       to  variables that are updated with a pointer to the new memory and the
       number of code units that comprise the substring, again  excluding  the
       terminating  zero.  When  the substring is no longer needed, the memory
       should be freed by calling pcre2_substring_free().

       The return value from all these functions is zero  for  success,  or  a
       negative  error  code.  If  the pattern match failed, the match failure
       code is returned.  If a substring number  greater  than  zero  is  used
       after  a partial match, PCRE2_ERROR_PARTIAL is returned. Other possible
       error codes are:

         PCRE2_ERROR_NOMEMORY

       The buffer was too small for  pcre2_substring_copy_bynumber(),  or  the
       attempt to get memory failed for pcre2_substring_get_bynumber().

         PCRE2_ERROR_NOSUBSTRING

       There  is  no  substring  with that number in the pattern, that is, the
       number is greater than the number of capturing parentheses.

         PCRE2_ERROR_UNAVAILABLE

       The substring number, though not greater than the number of captures in
       the pattern, is greater than the number of slots in the ovector, so the
       substring could not be captured.

         PCRE2_ERROR_UNSET

       The substring did not participate in the match.  For  example,  if  the
       pattern  is  (abc)|(def) and the subject is "def", and the ovector con-
       tains at least two capturing slots, substring number 1 is unset.


EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS

       int pcre2_substring_list_get(pcre2_match_data *match_data,
         PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);

       void pcre2_substring_list_free(PCRE2_SPTR *list);

       The pcre2_substring_list_get() function  extracts  all  available  sub-
       strings  and  builds  a  list of pointers to them. It also (optionally)
       builds a second list that  contains  their  lengths  (in  code  units),
       excluding a terminating zero that is added to each of them. All this is
       done in a single block of memory that is obtained using the same memory
       allocation function that was used to get the match data block.

       This  function  must be called only after a successful match. If called
       after a partial match, the error code PCRE2_ERROR_PARTIAL is returned.

       The address of the memory block is returned via listptr, which is  also
       the start of the list of string pointers. The end of the list is marked
       by a NULL pointer. The address of the list of lengths is  returned  via
       lengthsptr.  If your strings do not contain binary zeros and you do not
       therefore need the lengths, you may supply NULL as the lengthsptr argu-
       ment  to  disable  the  creation of a list of lengths. The yield of the
       function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the  mem-
       ory  block could not be obtained. When the list is no longer needed, it
       should be freed by calling pcre2_substring_list_free().

       If this function encounters a substring that is unset, which can happen
       when  capturing subpattern number n+1 matches some part of the subject,
       but subpattern n has not been used at all, it returns an empty  string.
       This  can  be  distinguished  from  a  genuine zero-length substring by
       inspecting  the  appropriate  offset  in  the  ovector,  which  contain
       PCRE2_UNSET   for   unset   substrings,   or   by   calling  pcre2_sub-
       string_length_bynumber().


EXTRACTING CAPTURED SUBSTRINGS BY NAME

       int pcre2_substring_number_from_name(const pcre2_code *code,
         PCRE2_SPTR name);

       int pcre2_substring_length_byname(pcre2_match_data *match_data,
         PCRE2_SPTR name, PCRE2_SIZE *length);

       int pcre2_substring_copy_byname(pcre2_match_data *match_data,
         PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);

       int pcre2_substring_get_byname(pcre2_match_data *match_data,
         PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);

       void pcre2_substring_free(PCRE2_UCHAR *buffer);

       To extract a substring by name, you first have to find associated  num-
       ber.  For example, for this pattern:

         (a+)b(?<xxx>\d+)...

       the number of the subpattern called "xxx" is 2. If the name is known to
       be unique (PCRE2_DUPNAMES was not set), you can find  the  number  from
       the name by calling pcre2_substring_number_from_name(). The first argu-
       ment is the compiled pattern, and the second is the name. The yield  of
       the function is the subpattern number, PCRE2_ERROR_NOSUBSTRING if there
       is no subpattern of  that  name,  or  PCRE2_ERROR_NOUNIQUESUBSTRING  if
       there  is  more than one subpattern of that name. Given the number, you
       can extract the  substring  directly,  or  use  one  of  the  functions
       described above.

       For  convenience,  there are also "byname" functions that correspond to
       the "bynumber" functions, the only difference  being  that  the  second
       argument  is  a  name instead of a number. If PCRE2_DUPNAMES is set and
       there are duplicate names, these functions scan all the groups with the
       given name, and return the first named string that is set.

       If  there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is
       returned. If all groups with the name have  numbers  that  are  greater
       than  the  number  of  slots in the ovector, PCRE2_ERROR_UNAVAILABLE is
       returned. If there is at least one group with a slot  in  the  ovector,
       but no group is found to be set, PCRE2_ERROR_UNSET is returned.

       Warning: If the pattern uses the (?| feature to set up multiple subpat-
       terns with the same number, as described in the  section  on  duplicate
       subpattern  numbers  in  the pcre2pattern page, you cannot use names to
       distinguish the different subpatterns, because names are  not  included
       in  the compiled code. The matching process uses only numbers. For this
       reason, the use of different names for subpatterns of the  same  number
       causes an error at compile time.


CREATING A NEW STRING WITH SUBSTITUTIONS

       int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
         PCRE2_SIZE length, PCRE2_SIZE startoffset,
         uint32_t options, pcre2_match_data *match_data,
         pcre2_match_context *mcontext, PCRE2_SPTR replacementzfP,
         PCRE2_SIZE rlength, PCRE2_UCHAR *outputbufferP,
         PCRE2_SIZE *outlengthptr);
       This  function calls pcre2_match() and then makes a copy of the subject
       string in outputbuffer, replacing the part that was  matched  with  the
       replacement  string,  whose  length is supplied in rlength. This can be
       given as PCRE2_ZERO_TERMINATED for a zero-terminated string.

       In the replacement string, which is interpreted as a UTF string in  UTF
       mode,  and  is  checked  for UTF validity unless the PCRE2_NO_UTF_CHECK
       option is set, a dollar character is an escape character that can spec-
       ify  the  insertion of characters from capturing groups in the pattern.
       The following forms are recognized:

         $$      insert a dollar character
         $<n>    insert the contents of group <n>
         ${<n>}  insert the contents of group <n>

       Either a group number or a group name  can  be  given  for  <n>.  Curly
       brackets  are  required only if the following character would be inter-
       preted as part of the number or name. The number may be zero to include
       the  entire  matched  string.   For  example,  if  the pattern a(b)c is
       matched with "=abc=" and the replacement string "+$1$0$1+", the  result
       is  "=+babcb+=". Group insertion is done by calling pcre2_copy_byname()
       or pcre2_copy_bynumber() as appropriate.

       The first seven arguments of pcre2_substitute() are  the  same  as  for
       pcre2_match(), except that the partial matching options are not permit-
       ted, and match_data may be passed as NULL, in which case a  match  data
       block  is obtained and freed within this function, using memory manage-
       ment functions from the match context, if provided, or else those  that
       were used to allocate memory for the compiled code.

       There  is  one additional option, PCRE2_SUBSTITUTE_GLOBAL, which causes
       the function to iterate over the subject string, replacing every match-
       ing substring. If this is not set, only the first matching substring is
       replaced.

       The outlengthptr argument must point to a variable  that  contains  the
       length,  in  code units, of the output buffer. It is updated to contain
       the length of the new string, excluding the trailing zero that is auto-
       matically added.

       The  function  returns  the number of replacements that were made. This
       may be zero if no matches were found,  and  is  never  greater  than  1
       unless PCRE2_SUBSTITUTE_GLOBAL is set. In the event of an error, a neg-
       ative error code is returned. Except for PCRE2_ERROR_NOMATCH (which  is
       never returned), any errors from pcre2_match() or the substring copying
       functions  are  passed  straight  back.  PCRE2_ERROR_BADREPLACEMENT  is
       returned  for an invalid replacement string (unrecognized sequence fol-
       lowing a dollar sign), and PCRE2_ERROR_NOMEMORY is returned if the out-
       put buffer is not big enough.


DUPLICATE SUBPATTERN NAMES

       int pcre2_substring_nametable_scan(const pcre2_code *code,
         PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);

       When  a  pattern  is compiled with the PCRE2_DUPNAMES option, names for
       subpatterns are not required to be unique. Duplicate names  are  always
       allowed  for subpatterns with the same number, created by using the (?|
       feature. Indeed, if such subpatterns are named, they  are  required  to
       use the same names.

       Normally, patterns with duplicate names are such that in any one match,
       only one of the named subpatterns participates. An example is shown  in
       the pcre2pattern documentation.

       When   duplicates   are   present,   pcre2_substring_copy_byname()  and
       pcre2_substring_get_byname() return the first  substring  corresponding
       to   the   given   name   that   is  set.  Only  if  none  are  set  is
       PCRE2_ERROR_UNSET is returned.  The  pcre2_substring_number_from_name()
       function returns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are
       duplicate names.

       If you want to get full details of all captured substrings for a  given
       name,  you  must use the pcre2_substring_nametable_scan() function. The
       first argument is the compiled pattern, and the second is the name.  If
       the  third  and fourth arguments are NULL, the function returns a group
       number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.

       When the third and fourth arguments are not NULL, they must be pointers
       to  variables  that are updated by the function. After it has run, they
       point to the first and last entries in the name-to-number table for the
       given  name,  and the function returns the length of each entry in code
       units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there  are
       no entries for the given name.

       The format of the name table is described above in the section entitled
       Information about a pattern above.  Given all the relevant entries  for
       the name, you can extract each of their numbers, and hence the captured
       data.


FINDING ALL POSSIBLE MATCHES AT ONE POSITION

       The traditional matching function uses a  similar  algorithm  to  Perl,
       which  stops when it finds the first match at a given point in the sub-
       ject. If you want to find all possible matches, or the longest possible
       match  at  a  given  position,  consider using the alternative matching
       function (see below) instead. If you cannot use the  alternative  func-
       tion, you can kludge it up by making use of the callout facility, which
       is described in the pcre2callout documentation.

       What you have to do is to insert a callout right at the end of the pat-
       tern.   When your callout function is called, extract and save the cur-
       rent matched substring. Then return 1, which  forces  pcre2_match()  to
       backtrack  and  try other alternatives. Ultimately, when it runs out of
       matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH.


MATCHING A PATTERN: THE ALTERNATIVE FUNCTION

       int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
         PCRE2_SIZE length, PCRE2_SIZE startoffset,
         uint32_t options, pcre2_match_data *match_data,
         pcre2_match_context *mcontext,
         int *workspace, PCRE2_SIZE wscount);

       The function pcre2_dfa_match() is called  to  match  a  subject  string
       against  a  compiled pattern, using a matching algorithm that scans the
       subject string just once, and does not backtrack.  This  has  different
       characteristics  to  the  normal  algorithm, and is not compatible with
       Perl. Some of the features of PCRE2 patterns are not supported.  Never-
       theless,  there are times when this kind of matching can be useful. For
       a discussion of the two matching algorithms, and  a  list  of  features
       that pcre2_dfa_match() does not support, see the pcre2matching documen-
       tation.

       The arguments for the pcre2_dfa_match() function are the  same  as  for
       pcre2_match(), plus two extras. The ovector within the match data block
       is used in a different way, and this is described below. The other com-
       mon  arguments  are used in the same way as for pcre2_match(), so their
       description is not repeated here.

       The two additional arguments provide workspace for  the  function.  The
       workspace  vector  should  contain at least 20 elements. It is used for
       keeping  track  of  multiple  paths  through  the  pattern  tree.  More
       workspace  is needed for patterns and subjects where there are a lot of
       potential matches.

       Here is an example of a simple call to pcre2_dfa_match():

         int wspace[20];
         pcre2_match_data *md = pcre2_match_data_create(4, NULL);
         int rc = pcre2_dfa_match(
           re,             /* result of pcre2_compile() */
           "some string",  /* the subject string */
           11,             /* the length of the subject string */
           0,              /* start at offset 0 in the subject */
           0,              /* default options */
           match_data,     /* the match data block */
           NULL,           /* a match context; NULL means use defaults */
           wspace,         /* working space vector */
           20);            /* number of elements (NOT size in bytes) */

   Option bits for pcre_dfa_match()

       The unused bits of the options argument for pcre2_dfa_match()  must  be
       zero.  The  only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL,
       PCRE2_NOTEOL,          PCRE2_NOTEMPTY,          PCRE2_NOTEMPTY_ATSTART,
       PCRE2_NO_UTF_CHECK,       PCRE2_PARTIAL_HARD,       PCRE2_PARTIAL_SOFT,
       PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but  the  last  four  of
       these  are  exactly the same as for pcre2_match(), so their description
       is not repeated here.

         PCRE2_PARTIAL_HARD
         PCRE2_PARTIAL_SOFT

       These have the same general effect as they do  for  pcre2_match(),  but
       the  details are slightly different. When PCRE2_PARTIAL_HARD is set for
       pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if  the  end  of  the
       subject is reached and there is still at least one matching possibility
       that requires additional characters. This happens even if some complete
       matches  have  already  been found. When PCRE2_PARTIAL_SOFT is set, the
       return code PCRE2_ERROR_NOMATCH is converted  into  PCRE2_ERROR_PARTIAL
       if  the  end  of  the  subject  is reached, there have been no complete
       matches, but there is still at least one matching possibility. The por-
       tion  of  the  string that was inspected when the longest partial match
       was found is set as the first matching string in both cases. There is a
       more  detailed  discussion  of partial and multi-segment matching, with
       examples, in the pcre2partial documentation.

         PCRE2_DFA_SHORTEST

       Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm  to
       stop as soon as it has found one match. Because of the way the alterna-
       tive algorithm works, this is necessarily the shortest  possible  match
       at the first possible matching point in the subject string.

         PCRE2_DFA_RESTART

       When  pcre2_dfa_match() returns a partial match, it is possible to call
       it again, with additional subject characters, and have it continue with
       the same match. The PCRE2_DFA_RESTART option requests this action; when
       it is set, the workspace and wscount options must  reference  the  same
       vector  as  before  because data about the match so far is left in them
       after a partial match. There is more discussion of this facility in the
       pcre2partial documentation.

   Successful returns from pcre2_dfa_match()

       When pcre2_dfa_match() succeeds, it may have matched more than one sub-
       string in the subject. Note, however, that all the matches from one run
       of  the  function  start  at the same point in the subject. The shorter
       matches are all initial substrings of the longer matches. For  example,
       if the pattern

         <.*>

       is matched against the string

         This is <something> <something else> <something further> no more

       the three matched strings are

         <something> <something else> <something further>
         <something> <something else>
         <something>

       On  success,  the  yield of the function is a number greater than zero,
       which is the number of matched substrings.  The  offsets  of  the  sub-
       strings  are returned in the ovector, and can be extracted by number in
       the same way as for pcre2_match(), but the numbers bear no relation  to
       any  capturing groups that may exist in the pattern, because DFA match-
       ing does not support group capture.

       Calls to the convenience functions  that  extract  substrings  by  name
       return  the  error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used
       after a DFA match. The convenience functions that extract substrings by
       number  never  return PCRE2_ERROR_NOSUBSTRING, and the meanings of some
       other errors are slightly different:

         PCRE2_ERROR_UNAVAILABLE

       The ovector is not big enough to include a slot for the given substring
       number.

         PCRE2_ERROR_UNSET

       There  is  a  slot  in  the  ovector for this substring, but there were
       insufficient matches to fill it.

       The matched strings are stored in  the  ovector  in  reverse  order  of
       length;  that  is,  the longest matching string is first. If there were
       too many matches to fit into the ovector, the yield of the function  is
       zero, and the vector is filled with the longest matches.

       NOTE:  PCRE2's  "auto-possessification" optimization usually applies to
       character repeats at the end of a pattern (as well as internally).  For
       example,  the pattern "a\d+" is compiled as if it were "a\d++". For DFA
       matching, this means that only one possible  match  is  found.  If  you
       really  do  want multiple matches in such cases, either use an ungreedy
       repeat auch as "a\d+?" or set  the  PCRE2_NO_AUTO_POSSESS  option  when
       compiling.

   Error returns from pcre2_dfa_match()

       The pcre2_dfa_match() function returns a negative number when it fails.
       Many of the errors are the same  as  for  pcre2_match(),  as  described
       above.  There are in addition the following errors that are specific to
       pcre2_dfa_match():

         PCRE2_ERROR_DFA_UITEM

       This return is given if pcre2_dfa_match() encounters  an  item  in  the
       pattern that it does not support, for instance, the use of \C or a back
       reference.

         PCRE2_ERROR_DFA_UCOND

       This return is given if pcre2_dfa_match() encounters a  condition  item
       that  uses  a back reference for the condition, or a test for recursion
       in a specific group. These are not supported.

         PCRE2_ERROR_DFA_WSSIZE

       This return is given if pcre2_dfa_match() runs  out  of  space  in  the
       workspace vector.

         PCRE2_ERROR_DFA_RECURSE

       When  a  recursive subpattern is processed, the matching function calls
       itself recursively, using private memory for the ovector and workspace.
       This  error  is given if the internal ovector is not large enough. This
       should be extremely rare, as a vector of size 1000 is used.

         PCRE2_ERROR_DFA_BADRESTART

       When pcre2_dfa_match() is called  with  the  PCRE2_DFA_RESTART  option,
       some  plausibility  checks  are  made on the contents of the workspace,
       which should contain data about the previous partial match. If  any  of
       these checks fail, this error is given.


SEE ALSO

       pcre2build(3),    pcre2callout(3),    pcre2demo(3),   pcre2matching(3),
       pcre2partial(3),    pcre2posix(3),    pcre2sample(3),    pcre2stack(3),
       pcre2unicode(3).


AUTHOR

       Philip Hazel
       University Computing Service
       Cambridge, England.


REVISION

       Last updated: 02 January 2015
       Copyright (c) 1997-2015 University of Cambridge.
------------------------------------------------------------------------------


PCRE2BUILD(3)              Library Functions Manual              PCRE2BUILD(3)



NAME
       PCRE2 - Perl-compatible regular expressions (revised API)

BUILDING PCRE2

       PCRE2  is distributed with a configure script that can be used to build
       the library in Unix-like environments using the applications  known  as
       Autotools. Also in the distribution are files to support building using
       CMake instead of configure.  The  text  file  README  contains  general
       information  about  building  with Autotools (some of which is repeated
       below), and also has some comments about building on various  operating
       systems.  There  is a lot more information about building PCRE2 without
       using Autotools (including information about using CMake  and  building
       "by  hand")  in  the  text file called NON-AUTOTOOLS-BUILD.  You should
       consult this file as well as the README file if you are building  in  a
       non-Unix-like environment.


PCRE2 BUILD-TIME OPTIONS

       The rest of this document describes the optional features of PCRE2 that
       can be selected when the library is compiled. It  assumes  use  of  the
       configure  script,  where  the  optional features are selected or dese-
       lected by providing options to configure before running the  make  com-
       mand.  However,  the same options can be selected in both Unix-like and
       non-Unix-like environments if you are using CMake instead of  configure
       to build PCRE2.

       If  you  are not using Autotools or CMake, option selection can be done
       by editing the config.h file, or by passing parameter settings  to  the
       compiler, as described in NON-AUTOTOOLS-BUILD.

       The complete list of options for configure (which includes the standard
       ones such as the  selection  of  the  installation  directory)  can  be
       obtained by running

         ./configure --help

       The  following  sections  include  descriptions  of options whose names
       begin with --enable or --disable. These settings specify changes to the
       defaults  for  the configure command. Because of the way that configure
       works, --enable and --disable always come in pairs, so  the  complemen-
       tary  option always exists as well, but as it specifies the default, it
       is not described.


BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES

       By default, a library called libpcre2-8 is built, containing  functions
       that  take  string arguments contained in vectors of bytes, interpreted
       either as single-byte characters, or UTF-8 strings. You can also  build
       two  other libraries, called libpcre2-16 and libpcre2-32, which process
       strings that are contained in vectors of 16-bit and 32-bit code  units,
       respectively. These can be interpreted either as single-unit characters
       or UTF-16/UTF-32 strings. To build these additional libraries, add  one
       or both of the following to the configure command:

         --enable-pcre2-16
         --enable-pcre2-32

       If you do not want the 8-bit library, add

         --disable-pcre2-8

       as  well.  At least one of the three libraries must be built. Note that
       the POSIX wrapper is for the 8-bit library only, and that pcre2grep  is
       an  8-bit  program.  Neither  of these are built if you select only the
       16-bit or 32-bit libraries.


BUILDING SHARED AND STATIC LIBRARIES

       The Autotools PCRE2 building process uses libtool to build both  shared
       and  static  libraries by default. You can suppress an unwanted library
       by adding one of

         --disable-shared
         --disable-static

       to the configure command.


UNICODE AND UTF SUPPORT

       By default, PCRE2 is built with support for Unicode and  UTF  character
       strings.  To build it without Unicode support, add

         --disable-unicode

       to  the configure command. This setting applies to all three libraries.
       It is not possible to build  one  library  with  Unicode  support,  and
       another without, in the same configuration.

       Of  itself, Unicode support does not make PCRE2 treat strings as UTF-8,
       UTF-16 or UTF-32. To do that, applications that use the library have to
       set  the  PCRE2_UTF  option when they call pcre2_compile() to compile a
       pattern.

       UTF support allows the libraries to process character code points up to
       0x10ffff  in the strings that they handle. It also provides support for
       accessing the Unicode properties  of  such  characters,  using  pattern
       escapes  such  as  \P, \p, and \X. Only the general category properties
       such as Lu and Nd are supported. Details are given in the  pcre2pattern
       documentation.


JUST-IN-TIME COMPILER SUPPORT

       Just-in-time compiler support is included in the build by specifying

         --enable-jit

       This  support  is available only for certain hardware architectures. If
       this option is set for an unsupported architecture,  a  building  error
       occurs.   See the pcre2jit documentation for a discussion of JIT usage.
       When JIT support is enabled, pcre2grep automatically makes use  of  it,
       unless you add

         --disable-pcre2grep-jit

       to the "configure" command.


NEWLINE RECOGNITION

       By  default, PCRE2 interprets the linefeed (LF) character as indicating
       the end of a line. This is the normal newline  character  on  Unix-like
       systems.  You can compile PCRE2 to use carriage return (CR) instead, by
       adding

         --enable-newline-is-cr

       to the configure  command.  There  is  also  an  --enable-newline-is-lf
       option, which explicitly specifies linefeed as the newline character.

       Alternatively, you can specify that line endings are to be indicated by
       the two-character sequence CRLF (CR immediately followed by LF). If you
       want this, add

         --enable-newline-is-crlf

       to the configure command. There is a fourth option, specified by

         --enable-newline-is-anycrlf

       which  causes  PCRE2 to recognize any of the three sequences CR, LF, or
       CRLF as indicating a line ending. Finally, a fifth option, specified by

         --enable-newline-is-any

       causes PCRE2 to recognize any Unicode  newline  sequence.  The  Unicode
       newline sequences are the three just mentioned, plus the single charac-
       ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line,
       U+0085),  LS  (line  separator,  U+2028),  and PS (paragraph separator,
       U+2029).

       Whatever default line ending convention is selected when PCRE2 is built
       can  be  overridden by applications that use the library. At build time
       it is conventional to use the standard for your operating system.


WHAT \R MATCHES

       By default, the sequence \R in a pattern matches  any  Unicode  newline
       sequence,  independently  of  what has been selected as the line ending
       sequence. If you specify

         --enable-bsr-anycrlf

       the default is changed so that \R matches only CR, LF, or  CRLF.  What-
       ever  is selected when PCRE2 is built can be overridden by applications
       that use the called.


HANDLING VERY LARGE PATTERNS

       Within a compiled pattern, offset values are used  to  point  from  one
       part  to another (for example, from an opening parenthesis to an alter-
       nation metacharacter). By default, in the 8-bit and  16-bit  libraries,
       two-byte  values  are used for these offsets, leading to a maximum size
       for a compiled pattern of around 64K code units. This is sufficient  to
       handle all but the most gigantic patterns. Nevertheless, some people do
       want to process truly enormous patterns, so it is possible  to  compile
       PCRE2  to  use three-byte or four-byte offsets by adding a setting such
       as

         --with-link-size=3

       to the configure command. The value given must be 2, 3, or 4.  For  the
       16-bit  library,  a  value of 3 is rounded up to 4. In these libraries,
       using longer offsets slows down the operation of PCRE2 because  it  has
       to  load additional data when handling them. For the 32-bit library the
       value is always 4 and cannot be overridden; the value  of  --with-link-
       size is ignored.


AVOIDING EXCESSIVE STACK USAGE

       When  matching  with the pcre2_match() function, PCRE2 implements back-
       tracking by making recursive  calls  to  an  internal  function  called
       match().  In  environments where the size of the stack is limited, this
       can severely limit PCRE2's operation. (The Unix  environment  does  not
       usually  suffer from this problem, but it may sometimes be necessary to
       increase  the  maximum  stack  size.  There  is  a  discussion  in  the
       pcre2stack  documentation.)  An  alternative approach to recursion that
       uses memory from the heap to remember data, instead of using  recursive
       function  calls, has been implemented to work round the problem of lim-
       ited stack size. If you want to build a version  of  PCRE2  that  works
       this way, add

         --disable-stack-for-recursion

       to the configure command. By default, the system functions malloc() and
       free() are called to manage the heap memory that is required, but  cus-
       tom  memory  management  functions  can  be  called instead. PCRE2 runs
       noticeably more slowly when built in this way. This option affects only
       the pcre2_match() function; it is not relevant for pcre2_dfa_match().


LIMITING PCRE2 RESOURCE USAGE

       Internally, PCRE2 has a function called match(), which it calls repeat-
       edly  (sometimes  recursively)  when  matching  a  pattern   with   the
       pcre2_match() function. By controlling the maximum number of times this
       function may be called during a single matching operation, a limit  can
       be  placed on the resources used by a single call to pcre2_match(). The
       limit can be changed at run time, as described in the pcre2api documen-
       tation.  The default is 10 million, but this can be changed by adding a
       setting such as

         --with-match-limit=500000

       to  the  configure  command.  This  setting  has  no  effect   on   the
       pcre2_dfa_match() matching function.

       In  some  environments  it is desirable to limit the depth of recursive
       calls of match() more strictly than the total number of calls, in order
       to  restrict  the maximum amount of stack (or heap, if --disable-stack-
       for-recursion is specified) that is used. A second limit controls this;
       it  defaults  to  the  value  that is set for --with-match-limit, which
       imposes no additional constraints. However, you can set a  lower  limit
       by adding, for example,

         --with-match-limit-recursion=10000

       to  the  configure  command.  This  value can also be overridden at run
       time.


CREATING CHARACTER TABLES AT BUILD TIME

       PCRE2 uses fixed tables for processing characters whose code points are
       less than 256. By default, PCRE2 is built with a set of tables that are
       distributed in the file src/pcre2_chartables.c.dist. These  tables  are
       for ASCII codes only. If you add

         --enable-rebuild-chartables

       to  the  configure  command, the distributed tables are no longer used.
       Instead, a program called dftables is compiled and  run.  This  outputs
       the source for new set of tables, created in the default locale of your
       C run-time system. (This method of replacing the tables does  not  work
       if  you are cross compiling, because dftables is run on the local host.
       If you need to create alternative tables when cross compiling, you will
       have to do so "by hand".)


USING EBCDIC CODE

       PCRE2  assumes  by default that it will run in an environment where the
       character code is ASCII or Unicode, which is a superset of ASCII.  This
       is the case for most computer operating systems. PCRE2 can, however, be
       compiled to run in an 8-bit EBCDIC environment by adding

         --enable-ebcdic --disable-unicode

       to the configure command. This setting implies --enable-rebuild-charta-
       bles.  You  should  only  use  it if you know that you are in an EBCDIC
       environment (for example, an IBM mainframe operating system).

       It is not possible to support both EBCDIC and UTF-8 codes in  the  same
       version  of  the  library. Consequently, --enable-unicode and --enable-
       ebcdic are mutually exclusive.

       The EBCDIC character that corresponds to an ASCII LF is assumed to have
       the  value  0x15 by default. However, in some EBCDIC environments, 0x25
       is used. In such an environment you should use

         --enable-ebcdic-nl25

       as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
       has  the  same  value  as in ASCII, namely, 0x0d. Whichever of 0x15 and
       0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
       acter (which, in Unicode, is 0x85).

       The options that select newline behaviour, such as --enable-newline-is-
       cr, and equivalent run-time options, refer to these character values in
       an EBCDIC environment.


PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT

       By  default,  pcre2grep reads all files as plain text. You can build it
       so that it recognizes files whose names end in .gz or .bz2,  and  reads
       them with libz or libbz2, respectively, by adding one or both of

         --enable-pcre2grep-libz
         --enable-pcre2grep-libbz2

       to the configure command. These options naturally require that the rel-
       evant libraries are installed on your system. Configuration  will  fail
       if they are not.


PCRE2GREP BUFFER SIZE

       pcre2grep  uses an internal buffer to hold a "window" on the file it is
       scanning, in order to be able to output "before" and "after" lines when
       it  finds  a match. The size of the buffer is controlled by a parameter
       whose default value is 20K. The buffer itself is three times this size,
       but because of the way it is used for holding "before" lines, the long-
       est line that is guaranteed to be processable is  the  parameter  size.
       You can change the default parameter value by adding, for example,

         --with-pcre2grep-bufsize=50K

       to  the  configure  command.  The caller of pcre2grep can override this
       value by using --buffer-size on the command line..


PCRE2TEST OPTION FOR LIBREADLINE SUPPORT

       If you add one of

         --enable-pcre2test-libreadline
         --enable-pcre2test-libedit

       to the configure command, pcre2test  is  linked  with  the  libreadline
       orlibedit library, respectively, and when its input is from a terminal,
       it reads it using the readline() function. This  provides  line-editing
       and  history  facilities.  Note that libreadline is GPL-licensed, so if
       you distribute a binary of pcre2test linked in this way, there  may  be
       licensing issues. These can be avoided by linking instead with libedit,
       which has a BSD licence.

       Setting --enable-pcre2test-libreadline causes the -lreadline option  to
       be  added to the pcre2test build. In many operating environments with a
       sytem-installed readline library this is sufficient. However,  in  some
       environments (e.g. if an unmodified distribution version of readline is
       in use), some extra configuration may be necessary.  The  INSTALL  file
       for libreadline says this:

         "Readline uses the termcap functions, but does not link with
         the termcap or curses library itself, allowing applications
         which link with readline the to choose an appropriate library."

       If  your environment has not been set up so that an appropriate library
       is automatically included, you may need to add something like

         LIBS="-ncurses"

       immediately before the configure command.


DEBUGGING WITH VALGRIND SUPPORT

       If you add

         --enable-valgrind

       to the configure command, PCRE2 will use valgrind annotations  to  mark
       certain  memory  regions  as  unaddressable.  This  allows it to detect
       invalid memory accesses, and  is  mostly  useful  for  debugging  PCRE2
       itself.


CODE COVERAGE REPORTING

       If  your  C  compiler is gcc, you can build a version of PCRE2 that can
       generate a code coverage report for its test suite. To enable this, you
       must install lcov version 1.6 or above. Then specify

         --enable-coverage

       to the configure command and build PCRE2 in the usual way.

       Note that using ccache (a caching C compiler) is incompatible with code
       coverage reporting. If you have configured ccache to run  automatically
       on your system, you must set the environment variable

         CCACHE_DISABLE=1

       before running make to build PCRE2, so that ccache is not used.

       When  --enable-coverage  is  used,  the  following addition targets are
       added to the Makefile:

         make coverage

       This creates a fresh coverage report for the PCRE2 test  suite.  It  is
       equivalent  to running "make coverage-reset", "make coverage-baseline",
       "make check", and then "make coverage-report".

         make coverage-reset

       This zeroes the coverage counters, but does nothing else.

         make coverage-baseline

       This captures baseline coverage information.

         make coverage-report

       This creates the coverage report.

         make coverage-clean-report

       This removes the generated coverage report without cleaning the  cover-
       age data itself.

         make coverage-clean-data

       This  removes  the captured coverage data without removing the coverage
       files created at compile time (*.gcno).

         make coverage-clean

       This cleans all coverage data including the generated coverage  report.
       For  more  information about code coverage, see the gcov and lcov docu-
       mentation.


SEE ALSO

       pcre2api(3), pcre2-config(3).


AUTHOR

       Philip Hazel
       University Computing Service
       Cambridge, England.


REVISION

       Last updated: 23 November 2014
       Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------


PCRE2CALLOUT(3)            Library Functions Manual            PCRE2CALLOUT(3)



NAME
       PCRE2 - Perl-compatible regular expressions (revised API)

SYNOPSIS

       #include <pcre2.h>

       int (*pcre2_callout)(pcre2_callout_block *, void *);


DESCRIPTION

       PCRE2  provides  a feature called "callout", which is a means of tempo-
       rarily passing control to the caller of PCRE2 in the middle of  pattern
       matching.  The caller of PCRE2 provides an external function by putting
       its entry point in a match context  (see  pcre2_set_callout())  in  the
       pcre2api documentation).

       Within  a  regular  expression,  (?C) indicates the points at which the
       external function is to be called.  Different  callout  points  can  be
       identified  by  putting  a number less than 256 after the letter C. The
       default value is zero.  For  example,  this  pattern  has  two  callout
       points:

         (?C1)abc(?C2)def

       If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled,
       PCRE2 automatically inserts callouts, all with number 255, before  each
       item  in  the  pattern. For example, if PCRE2_AUTO_CALLOUT is used with
       the pattern

         A(\d{2}|--)

       it is processed as if it were

       (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)

       Notice that there is a callout before and after  each  parenthesis  and
       alternation bar. If the pattern contains a conditional group whose con-
       dition is an assertion, an automatic callout  is  inserted  immediately
       before  the  condition. Such a callout may also be inserted explicitly,
       for example:

         (?(?C9)(?=a)ab|de)

       This applies only to assertion conditions (because they are  themselves
       independent groups).

       Automatic  callouts  can  be  used for tracking the progress of pattern
       matching.  The pcre2test program has a pattern  qualifier  (/auto_call-
       out)  that  sets  automatic callouts; when it is used, the output indi-
       cates how the pattern is being matched. This is useful information when
       you are trying to optimize the performance of a particular pattern.


MISSING CALLOUTS

       You  should  be  aware  that, because of optimizations in the way PCRE2
       compiles and matches patterns, callouts sometimes do not happen exactly
       as you might expect.

   Auto-possessification

       At compile time, PCRE2 "auto-possessifies" repeated items when it knows
       that what follows cannot be part of the repeat. For example, a+[bc]  is
       compiled  as if it were a++[bc]. The pcre2test output when this pattern
       is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
       to the string "aaaa" is:

         --->aaaa
          +0 ^        a+
          +2 ^   ^    [bc]
         No match

       This  indicates that when matching [bc] fails, there is no backtracking
       into a+ and therefore the callouts that would be taken  for  the  back-
       tracks  do  not  occur.  You can disable the auto-possessify feature by
       passing PCRE2_NO_AUTO_POSSESS to pcre2_compile(), or starting the  pat-
       tern with (*NO_AUTO_POSSESS). In this case, the output changes to this:

         --->aaaa
          +0 ^        a+
          +2 ^   ^    [bc]
          +2 ^  ^     [bc]
          +2 ^ ^      [bc]
          +2 ^^       [bc]
         No match

       This time, when matching [bc] fails, the matcher backtracks into a+ and
       tries again, repeatedly, until a+ itself fails.

   Automatic .* anchoring

       By default, an optimization is applied when .* is the first significant
       item  in  a  pattern. If PCRE2_DOTALL is set, so that the dot can match
       any character, the pattern is automatically anchored.  If  PCRE2_DOTALL
       is  not set, a match can start only after an internal newline or at the
       beginning of the subject,  and  pcre2_compile()  remembers  this.  This
       optimization  is  disabled,  however, if .* is in an atomic group or if
       there is a back reference to the capturing group in which  it  appears.
       It  is  also disabled if the pattern contains (*PRUNE) or (*SKIP). How-
       ever, the presence of callouts does not affect it.

       For example, if the pattern .*\d is  compiled  with  PCRE2_AUTO_CALLOUT
       and applied to the string "aa", the pcre2test output is:

         --->aa
          +0 ^      .*
          +2 ^ ^    \d
          +2 ^^     \d
          +2 ^      \d
         No match

       This  shows  that all match attempts start at the beginning of the sub-
       ject. In other words, the pattern is anchored.  You  can  disable  this
       optimization  by passing PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(), or
       starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the  out-
       put changes to:

         --->aa
          +0 ^      .*
          +2 ^ ^    \d
          +2 ^^     \d
          +2 ^      \d
          +0  ^     .*
          +2  ^^    \d
          +2  ^     \d
         No match

       This  shows more match attempts, starting at the second subject charac-
       ter.  Another optimization, described in the next section,  means  that
       there is no subsequent attempt to match with an empty subject.

       If  a  pattern  has more than one top-level branch, automatic anchoring
       occurs if all branches are anchorable.

   Other optimizations

       Other optimizations that provide fast "no match"  results  also  affect
       callouts.  For example, if the pattern is

         ab(?C4)cd

       PCRE2  knows  that  any matching string must contain the letter "d". If
       the subject string is "abyz", the  lack  of  "d"  means  that  matching
       doesn't  ever  start,  and  the callout is never reached. However, with
       "abyd", though the result is still no match, the callout is obeyed.

       PCRE2 also knows the minimum length of  a  matching  string,  and  will
       immediately  give  a "no match" return without actually running a match
       if the subject is not long enough, or, for unanchored patterns,  if  it
       has been scanned far enough.

       You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
       MIZE option  to  pcre2_compile(),  or  by  starting  the  pattern  with
       (*NO_START_OPT).  This slows down the matching process, but does ensure
       that callouts such as the example above are obeyed.


THE CALLOUT INTERFACE

       During matching, when PCRE2 reaches a callout  point,  if  an  external
       function  is  set  in  the match context, it is called. This applies to
       both normal and DFA matching. The first argument to the  callout  func-
       tion  is a pointer to a pcre2_callout block. The second argument is the
       void * callout data that was supplied when the callout was  set  up  by
       calling pcre2_set_callout() (see the pcre2api documentation). The call-
       out block structure contains the following fields:

         uint32_t      version;
         uint32_t      callout_number;
         uint32_t      capture_top;
         uint32_t      capture_last;
         PCRE2_SIZE   *offset_vector;
         PCRE2_SPTR    mark;
         PCRE2_SPTR    subject;
         PCRE2_SIZE    subject_length;
         PCRE2_SIZE    start_match;
         PCRE2_SIZE    current_position;
         PCRE2_SIZE    pattern_position;
         PCRE2_SIZE    next_item_length;

       The version field contains the version number of the block format.  The
       current version is 0. The version number will change in future if addi-
       tional fields are added, but the intention is never to  remove  any  of
       the existing fields.

       The  callout_number  field  contains the number of the callout, as com-
       piled into the pattern (that is, the number after ?C for  manual  call-
       outs, and 255 for automatically generated callouts).

       The offset_vector field is a pointer to the vector of capturing offsets
       (the "ovector") that was passed to the matching function in  the  match
       data  block.  When pcre2_match() is used, the contents can be inspected
       in order to extract substrings that have been matched so  far,  in  the
       same  way as for extracting substrings after a match has completed. For
       the DFA matching function, this field is not useful.

       The subject and subject_length fields contain copies of the values that
       were passed to the matching function.

       The  start_match  field normally contains the offset within the subject
       at which the current match attempt  started.  However,  if  the  escape
       sequence  \K has been encountered, this value is changed to reflect the
       modified starting point. If the pattern is not  anchored,  the  callout
       function may be called several times from the same point in the pattern
       for different starting points in the subject.

       The current_position field contains the offset within  the  subject  of
       the current match pointer.

       When the pcre2_match() is used, the capture_top field contains one more
       than the number of the highest numbered captured substring so  far.  If
       no substrings have been captured, the value of capture_top is one. This
       is always the case when the DFA functions are used, because they do not
       support captured substrings.

       The  capture_last  field  contains the number of the most recently cap-
       tured substring. However, when a recursion exits, the value reverts  to
       what  it  was  outside  the recursion, as do the values of all captured
       substrings. If no substrings have been  captured,  the  value  of  cap-
       ture_last is 0. This is always the case for the DFA matching functions.

       The  pattern_position  field contains the offset to the next item to be
       matched in the pattern string.

       The next_item_length field contains the length of the next item  to  be
       matched in the pattern string. When the callout immediately precedes an
       alternation bar, a closing parenthesis, or the end of the pattern,  the
       length  is  zero. When the callout precedes an opening parenthesis, the
       length is that of the entire subpattern.

       The pattern_position and next_item_length fields are intended  to  help
       in  distinguishing between different automatic callouts, which all have
       the same callout number. However, they are set for all callouts.

       In callouts from pcre2_match() the mark field contains a pointer to the
       zero-terminated  name of the most recently passed (*MARK), (*PRUNE), or
       (*THEN) item in the match, or NULL if no such items have  been  passed.
       Instances  of  (*PRUNE)  or  (*THEN) without a name do not obliterate a
       previous (*MARK). In callouts from the DFA matching function this field
       always contains NULL.


RETURN VALUES

       The external callout function returns an integer to PCRE2. If the value
       is zero, matching proceeds as normal. If  the  value  is  greater  than
       zero,  matching  fails  at  the current point, but the testing of other
       matching possibilities goes ahead, just as if a lookahead assertion had
       failed. If the value is less than zero, the match is abandoned, and the
       matching function returns the negative value.

       Negative  values  should  normally  be   chosen   from   the   set   of
       PCRE2_ERROR_xxx  values.  In  particular,  PCRE2_ERROR_NOMATCH forces a
       standard "no match" failure. The error  number  PCRE2_ERROR_CALLOUT  is
       reserved  for  use by callout functions; it will never be used by PCRE2
       itself.


AUTHOR

       Philip Hazel
       University Computing Service
       Cambridge, England.


REVISION

       Last updated: 02 January 2015
       Copyright (c) 1997-2015 University of Cambridge.
------------------------------------------------------------------------------


PCRE2COMPAT(3)             Library Functions Manual             PCRE2COMPAT(3)



NAME
       PCRE2 - Perl-compatible regular expressions (revised API)

DIFFERENCES BETWEEN PCRE2 AND PERL

       This document describes the differences in the ways that PCRE2 and Perl
       handle regular expressions. The differences  described  here  are  with
       respect to Perl versions 5.10 and above.

       1.  PCRE2  has only a subset of Perl's Unicode support. Details of what
       it does have are given in the pcre2unicode page.

       2. PCRE2 allows repeat quantifiers only  on  parenthesized  assertions,
       but  they  do not mean what you might think. For example, (?!a){3} does
       not assert that the next three characters are not "a". It just  asserts
       that  the  next  character  is not "a" three times (in principle: PCRE2
       optimizes this to run the assertion  just  once).  Perl  allows  repeat
       quantifiers  on  other  assertions such as \b, but these do not seem to
       have any use.

       3. Capturing subpatterns that occur inside  negative  lookahead  asser-
       tions  are  counted,  but their entries in the offsets vector are never
       set. Perl sometimes (but not always) sets its numerical variables  from
       inside negative assertions.

       4.  The  following Perl escape sequences are not supported: \l, \u, \L,
       \U, and \N when followed by a character name or Unicode value.  (\N  on
       its own, matching a non-newline character, is supported.) In fact these
       are implemented by Perl's general string-handling and are not  part  of
       its  pattern matching engine. If any of these are encountered by PCRE2,
       an error is generated by default. However, if the PCRE2_ALT_BSUX option
       is set, \U and \u are interpreted as ECMAScript interprets them.

       5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
       is built with Unicode support. The properties that can be  tested  with
       \p and \P are limited to the general category properties such as Lu and
       Nd, script names such as Greek or Han, and the derived  properties  Any
       and L&. PCRE2 does support the Cs (surrogate) property, which Perl does
       not; the Perl documentation says "Because Perl hides the need  for  the
       user  to  understand the internal representation of Unicode characters,
       there is no need to implement the  somewhat  messy  concept  of  surro-
       gates."

       6.  PCRE2 does support the \Q...\E escape for quoting substrings. Char-
       acters in between are treated as literals. This is  slightly  different
       from  Perl  in  that  $  and  @ are also handled as literals inside the
       quotes. In Perl, they cause variable interpolation (but of course PCRE2
       does not have variables).  Note the following examples:

           Pattern            PCRE2 matches      Perl matches

           \Qabc$xyz\E        abc$xyz           abc followed by the
                                                  contents of $xyz
           \Qabc\$xyz\E       abc\$xyz          abc\$xyz
           \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz

       The  \Q...\E  sequence  is recognized both inside and outside character
       classes.

       7.  Fairly  obviously,  PCRE2  does  not  support  the  (?{code})   and
       (??{code})  constructions. However, there is support for recursive pat-
       terns. This is not available in Perl 5.8, but it is in Perl 5.10. Also,
       the  PCRE2  "callout"  feature allows an external function to be called
       during  pattern  matching.  See  the  pcre2callout  documentation   for
       details.

       8.  Subpatterns  that  are called as subroutines (whether or not recur-
       sively) are always treated as atomic groups  in  PCRE2.  This  is  like
       Python,  but  unlike Perl.  Captured values that are set outside a sub-
       routine call can be reference from inside in PCRE2, but  not  in  Perl.
       There is a discussion that explains these differences in more detail in
       the section on recursion differences  from  Perl  in  the  pcre2pattern
       page.

       9.  If  any  of the backtracking control verbs are used in a subpattern
       that is called as a subroutine  (whether  or  not  recursively),  their
       effect  is  confined to that subpattern; it does not extend to the sur-
       rounding pattern. This is not always the case in Perl.  In  particular,
       if  (*THEN)  is  present in a group that is called as a subroutine, its
       action is limited to that group, even if the group does not contain any
       |  characters.  Note that such subpatterns are processed as anchored at
       the point where they are tested.

       10. If a pattern contains more than one backtracking control verb,  the
       first  one  that  is backtracked onto acts. For example, in the pattern
       A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but  a  failure
       in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
       it is the same as PCRE2, but there are examples where it differs.

       11. Most backtracking verbs in assertions have  their  normal  actions.
       They are not confined to the assertion.

       12.  There are some differences that are concerned with the settings of
       captured strings when part of  a  pattern  is  repeated.  For  example,
       matching  "aba"  against  the  pattern  /^(a(b)?)+$/  in Perl leaves $2
       unset, but in PCRE2 it is set to "b".

       13. PCRE2's handling of duplicate subpattern numbers and duplicate sub-
       pattern names is not as general as Perl's. This is a consequence of the
       fact the PCRE2 works internally just with numbers,  using  an  external
       table  to translate between numbers and names. In particular, a pattern
       such as (?|(?<a>A)|(?<b)B), where the two  capturing  parentheses  have
       the  same  number  but different names, is not supported, and causes an
       error at compile time. If it were allowed, it would not be possible  to
       distinguish  which  parentheses matched, because both names map to cap-
       turing subpattern number 1. To avoid this confusing situation, an error
       is given at compile time.

       14.  Perl  recognizes  comments in some places that PCRE2 does not, for
       example, between the ( and ? at the start of a subpattern.  If  the  /x
       modifier  is  set, Perl allows white space between ( and ? (though cur-
       rent Perls warn that this is deprecated) but PCRE2 never does, even  if
       the PCRE2_EXTENDED option is set.

       15.  Perl,  when  in warning mode, gives warnings for character classes
       such as [A-\d] or [a-[:digit:]]. It then treats the hyphens  as  liter-
       als. PCRE2 has no warning features, so it gives an error in these cases
       because they are almost certainly user mistakes.

       16. In PCRE2, the upper/lower case character properties Lu and  Ll  are
       not  affected when case-independent matching is specified. For example,
       \p{Lu} always matches an upper case letter. I think Perl has changed in
       this  respect; in the release at the time of writing (5.16), \p{Lu} and
       \p{Ll} match all letters, regardless of case, when case independence is
       specified.

       17.  PCRE2  provides  some  extensions  to  the Perl regular expression
       facilities.  Perl 5.10 includes new features that are  not  in  earlier
       versions  of  Perl, some of which (such as named parentheses) have been
       in PCRE2 for some time. This list is with respect to Perl 5.10:

       (a) Although lookbehind assertions in PCRE2  must  match  fixed  length
       strings,  each alternative branch of a lookbehind assertion can match a
       different length of string. Perl requires them all  to  have  the  same
       length.

       (b)  If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the
       $ meta-character matches only at the very end of the string.

       (c) A backslash followed  by  a  letter  with  no  special  meaning  is
       faulted. (Perl can be made to issue a warning.)

       (d)  If PCRE2_UNGREEDY is set, the greediness of the repetition quanti-
       fiers is inverted, that is, by default they are not greedy, but if fol-
       lowed by a question mark they are.

       (e)  PCRE2_ANCHORED  can be used at matching time to force a pattern to
       be tried only at the first matching position in the subject string.

       (f)      The      PCRE2_NOTBOL,      PCRE2_NOTEOL,      PCRE2_NOTEMPTY,
       PCRE2_NOTEMPTY_ATSTART,  and PCRE2_NO_AUTO_CAPTURE options have no Perl
       equivalents.

       (g) The \R escape sequence can be restricted to match only CR,  LF,  or
       CRLF by the PCRE2_BSR_ANYCRLF option.

       (h) The callout facility is PCRE2-specific.

       (i) The partial matching facility is PCRE2-specific.

       (j)  The  alternative matching function (pcre2_dfa_match() matches in a
       different way and is not Perl-compatible.

       (k) PCRE2 recognizes some special sequences such as (*CR) at the  start
       of a pattern that set overall options that cannot be changed within the
       pattern.


AUTHOR

       Philip Hazel
       University Computing Service
       Cambridge, England.


REVISION

       Last updated: 28 September 2014
       Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------


PCRE2JIT(3)                Library Functions Manual                PCRE2JIT(3)



NAME
       PCRE2 - Perl-compatible regular expressions (revised API)

PCRE2 JUST-IN-TIME COMPILER SUPPORT

       Just-in-time  compiling  is a heavyweight optimization that can greatly
       speed up pattern matching. However, it comes at the cost of extra  pro-
       cessing  before  the  match is performed, so it is of most benefit when
       the same pattern is going to be matched many times. This does not  nec-
       essarily  mean many calls of a matching function; if the pattern is not
       anchored, matching attempts may take place many times at various  posi-
       tions in the subject, even for a single call. Therefore, if the subject
       string is very long, it may still pay  to  use  JIT  even  for  one-off
       matches.  JIT  support  is  available  for all of the 8-bit, 16-bit and
       32-bit PCRE2 libraries.

       JIT support applies only to the  traditional  Perl-compatible  matching
       function.   It  does  not apply when the DFA matching function is being
       used. The code for this support was written by Zoltan Herczeg.


AVAILABILITY OF JIT SUPPORT

       JIT support is an optional feature of  PCRE2.  The  "configure"  option
       --enable-jit  (or  equivalent  CMake  option) must be set when PCRE2 is
       built if you want to use JIT. The support is limited to  the  following
       hardware platforms:

         ARM 32-bit (v5, v7, and Thumb2)
         ARM 64-bit
         Intel x86 32-bit and 64-bit
         MIPS 32-bit and 64-bit
         Power PC 32-bit and 64-bit
         SPARC 32-bit

       If --enable-jit is set on an unsupported platform, compilation fails.

       A  program  can  tell if JIT support is available by calling pcre2_con-
       fig() with the PCRE2_CONFIG_JIT option. The result is  1  when  JIT  is
       available,  and 0 otherwise. However, a simple program does not need to
       check this in order to use JIT. The API is implemented in  a  way  that
       falls  back  to the interpretive code if JIT is not available. For pro-
       grams that need the best possible performance, there is  also  a  "fast
       path" API that is JIT-specific.


SIMPLE USE OF JIT

       To  make use of the JIT support in the simplest way, all you have to do
       is to call pcre2_jit_compile() after successfully compiling  a  pattern
       with pcre2_compile(). This function has two arguments: the first is the
       compiled pattern pointer that was returned by pcre2_compile(), and  the
       second  is  zero  or  more of the following option bits: PCRE2_JIT_COM-
       PLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.

       If JIT support is not available, a  call  to  pcre2_jit_compile()  does
       nothing  and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled
       pattern is passed to the JIT compiler, which turns it into machine code
       that executes much faster than the normal interpretive code, but yields
       exactly the same results. The returned value  from  pcre2_jit_compile()
       is zero on success, or a negative error code.

       PCRE2_JIT_COMPLETE  requests the JIT compiler to generate code for com-
       plete matches. If you want to run partial matches using the  PCRE2_PAR-
       TIAL_HARD  or  PCRE2_PARTIAL_SOFT  options of pcre2_match(), you should
       set one or both of  the  other  options  as  well  as,  or  instead  of
       PCRE2_JIT_COMPLETE. The JIT compiler generates different optimized code
       for each of the three modes (normal, soft partial, hard partial).  When
       pcre2_match()  is  called,  the appropriate code is run if it is avail-
       able. Otherwise, the pattern is matched using interpretive code.

       You can call pcre2_jit_compile() multiple times for the  same  compiled
       pattern.  It does nothing if it has previously compiled code for any of
       the option bits. For example, you can call it once with  PCRE2_JIT_COM-
       PLETE  and  (perhaps  later,  when  you find you need partial matching)
       again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time  it
       will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
       ing. If pcre2_jit_compile() is called with no option bits set, it imme-
       diately returns zero. This is an alternative way of testing whether JIT
       is available.

       At present, it is not possible to free JIT compiled  code  except  when
       the entire compiled pattern is freed by calling pcre2_code_free().

       In  some circumstances you may need to call additional functions. These
       are described in the  section  entitled  "Controlling  the  JIT  stack"
       below.

       There are some pcre2_match() options that are not supported by JIT, and
       there are also some pattern items that JIT cannot handle.  Details  are
       given  below.  In  both cases, matching automatically falls back to the
       interpretive code. If you want to know whether JIT  was  actually  used
       for  a particular match, you should arrange for a JIT callback function
       to be set up as described in the section entitled "Controlling the  JIT
       stack"  below,  even  if  you  do  not need to supply a non-default JIT
       stack. Such a callback function is called whenever JIT code is about to
       be  obeyed.  If the match-time options are not right for JIT execution,
       the callback function is not obeyed.

       If the JIT compiler finds an unsupported item, no JIT  data  is  gener-
       ated.  You  can find out if JIT matching is available after compiling a
       pattern by calling  pcre2_pattern_info()  with  the  PCRE2_INFO_JITSIZE
       option.  A non-zero result means that JIT compilation was successful. A
       result of 0 means that JIT support is not available, or the pattern was
       not  processed by pcre2_jit_compile(), or the JIT compiler was not able
       to handle the pattern.


UNSUPPORTED OPTIONS AND PATTERN ITEMS

       The pcre2_match() options that  are  supported  for  JIT  matching  are
       PCRE2_NOTBOL,   PCRE2_NOTEOL,  PCRE2_NOTEMPTY,  PCRE2_NOTEMPTY_ATSTART,
       PCRE2_NO_UTF_CHECK,  PCRE2_PARTIAL_HARD,  and  PCRE2_PARTIAL_SOFT.  The
       PCRE2_ANCHORED option is not supported at match time.

       The  only  unsupported  pattern items are \C (match a single data unit)
       when running in a UTF mode, and a callout immediately before an  asser-
       tion condition in a conditional group.


RETURN VALUES FROM JIT MATCHING

       When a pattern is matched using JIT matching, the return values are the
       same as those given by the interpretive pcre2_match()  code,  with  the
       addition  of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means
       that the memory used for the JIT stack was insufficient. See  "Control-
       ling the JIT stack" below for a discussion of JIT stack usage.

       The  error  code  PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if
       searching a very large pattern tree goes on for too long, as it  is  in
       the  same circumstance when JIT is not used, but the details of exactly
       what is counted are not the same. The PCRE2_ERROR_RECURSIONLIMIT  error
       code is never returned when JIT matching is used.


CONTROLLING THE JIT STACK

       When the compiled JIT code runs, it needs a block of memory to use as a
       stack.  By default, it uses 32K on the  machine  stack.  However,  some
       large   or   complicated  patterns  need  more  than  this.  The  error
       PCRE2_ERROR_JIT_STACKLIMIT is given when there  is  not  enough  stack.
       Three  functions  are provided for managing blocks of memory for use as
       JIT stacks. There is further discussion about the use of JIT stacks  in
       the section entitled "JIT stack FAQ" below.

       The  pcre2_jit_stack_create()  function  creates a JIT stack. Its argu-
       ments are a starting size, a maximum size, and a general  context  (for
       memory  allocation  functions, or NULL for standard memory allocation).
       It returns a pointer to an opaque structure of type pcre2_jit_stack, or
       NULL  if there is an error. The pcre2_jit_stack_free() function is used
       to free a stack that is no longer needed. (For the technically  minded:
       the address space is allocated by mmap or VirtualAlloc.)

       JIT  uses far less memory for recursion than the interpretive code, and
       a maximum stack size of 512K to 1M should be more than enough  for  any
       pattern.

       The  pcre2_jit_stack_assign()  function  specifies which stack JIT code
       should use. Its arguments are as follows:

         pcre2_match_context  *mcontext
         pcre2_jit_callback    callback
         void                 *data

       The first argument is a pointer to a match context. When this is subse-
       quently passed to a matching function, its information determines which
       JIT stack is used. There are three cases for the values  of  the  other
       two options:

         (1) If callback is NULL and data is NULL, an internal 32K block
             on the machine stack is used. This is the default when a match
             context is created.

         (2) If callback is NULL and data is not NULL, data must be
             a pointer to a valid JIT stack, the result of calling
             pcre2_jit_stack_create().

         (3) If callback is not NULL, it must point to a function that is
             called with data as an argument at the start of matching, in
             order to set up a JIT stack. If the return from the callback
             function is NULL, the internal 32K stack is used; otherwise the
             return value must be a valid JIT stack, the result of calling
             pcre2_jit_stack_create().

       A  callback function is obeyed whenever JIT code is about to be run; it
       is not obeyed when pcre2_match() is called with options that are incom-
       patible  for JIT matching. A callback function can therefore be used to
       determine whether a match operation was  executed  by  JIT  or  by  the
       interpreter.

       You may safely use the same JIT stack for more than one pattern (either
       by assigning directly or by callback), as long as the patterns are  all
       matched  sequentially in the same thread. In a multithread application,
       if you do not specify a JIT stack, or if you assign or pass  back  NULL
       from  a  callback, that is thread-safe, because each thread has its own
       machine stack. However, if you assign  or  pass  back  a  non-NULL  JIT
       stack,  this  must  be  a  different  stack for each thread so that the
       application is thread-safe.

       Strictly speaking, even more is allowed. You can assign the  same  non-
       NULL  stack  to a match context that is used by any number of patterns,
       as long as they are not used for matching by multiple  threads  at  the
       same  time.  For  example, you could use the same stack in all compiled
       patterns, with a global mutex in the callback to wait until  the  stack
       is available for use. However, this is an inefficient solution, and not
       recommended.

       This is a suggestion for how a multithreaded program that needs to  set
       up non-default JIT stacks might operate:

         During thread initalization
           thread_local_var = pcre2_jit_stack_create(...)

         During thread exit
           pcre2_jit_stack_free(thread_local_var)

         Use a one-line callback function
           return thread_local_var

       All  the  functions  described in this section do nothing if JIT is not
       available.


JIT STACK FAQ

       (1) Why do we need JIT stacks?

       PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack
       where  the local data of the current node is pushed before checking its
       child nodes.  Allocating real machine stack on some platforms is diffi-
       cult. For example, the stack chain needs to be updated every time if we
       extend the stack on PowerPC.  Although it  is  possible,  its  updating
       time overhead decreases performance. So we do the recursion in memory.

       (2) Why don't we simply allocate blocks of memory with malloc()?

       Modern  operating  systems  have  a  nice  feature: they can reserve an
       address space instead of allocating memory. We can safely allocate mem-
       ory  pages  inside  this address space, so the stack could grow without
       moving memory data (this is important because of pointers). Thus we can
       allocate  1M  address space, and use only a single memory page (usually
       4K) if that is enough. However, we can still grow up to 1M  anytime  if
       needed.

       (3) Who "owns" a JIT stack?

       The owner of the stack is the user program, not the JIT studied pattern
       or anything else. The user program must ensure that if a stack is being
       used by pcre2_match(), (that is, it is assigned to a match context that
       is passed to the pattern currently running), that  stack  must  not  be
       used  by any other threads (to avoid overwriting the same memory area).
       The best practice for multithreaded programs is to allocate a stack for
       each thread, and return this stack through the JIT callback function.

       (4) When should a JIT stack be freed?

       You can free a JIT stack at any time, as long as it will not be used by
       pcre2_match() again. When you assign the stack to a match context, only
       a  pointer  is  set. There is no reference counting or any other magic.
       You can free compiled patterns, contexts, and stacks in any order, any-
       time.  Just  do not call pcre2_match() with a match context pointing to
       an already freed stack, as that will cause SEGFAULT. (Also, do not free
       a  stack  currently  used  by pcre2_match() in another thread). You can
       also replace the stack in a context at any time when it is not in  use.
       You should free the previous stack before assigning a replacement.

       (5)  Should  I  allocate/free  a  stack every time before/after calling
       pcre2_match()?

       No, because this is too costly in  terms  of  resources.  However,  you
       could  implement  some clever idea which release the stack if it is not
       used in let's say two minutes. The JIT callback  can  help  to  achieve
       this without keeping a list of patterns.

       (6)  OK, the stack is for long term memory allocation. But what happens
       if a pattern causes stack overflow with a stack of 1M? Is that 1M  kept
       until the stack is freed?

       Especially  on embedded sytems, it might be a good idea to release mem-
       ory sometimes without freeing the stack. There is no API  for  this  at
       the  moment.  Probably a function call which returns with the currently
       allocated memory for any stack and another which allows releasing  mem-
       ory (shrinking the stack) would be a good idea if someone needs this.

       (7) This is too much of a headache. Isn't there any better solution for
       JIT stack handling?

       No, thanks to Windows. If POSIX threads were used everywhere, we  could
       throw out this complicated API.


FREEING JIT SPECULATIVE MEMORY

       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);

       The JIT executable allocator does not free all memory when it is possi-
       ble.  It expects new allocations, and keeps some free memory around  to
       improve  allocation  speed. However, in low memory conditions, it might
       be better to free all possible memory. You can cause this to happen  by
       calling  pcre2_jit_free_unused_memory(). Its argument is a general con-
       text, for custom memory management, or NULL for standard memory manage-
       ment.


EXAMPLE CODE

       This  is  a  single-threaded example that specifies a JIT stack without
       using a callback. A real program should include  error  checking  after
       all the function calls.

         int rc;
         pcre2_code *re;
         pcre2_match_data *match_data;
         pcre2_match_context *mcontext;
         pcre2_jit_stack *jit_stack;

         re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, 0,
           &errornumber, &erroffset, NULL);
         rc = pcre2_jit_compile(re, PCRE2_JIT_COMPLETE);
         mcontext = pcre2_match_context_create(NULL);
         jit_stack = pcre2_jit_stack_create(32*1024, 512*1024, NULL);
         pcre2_jit_stack_assign(mcontext, NULL, jit_stack);
         match_data = pcre2_match_data_create(re, 10);
         rc = pcre2_match(re, subject, length, 0, 0, match_data, mcontext);
         /* Process result */

         pcre2_code_free(re);
         pcre2_match_data_free(match_data);
         pcre2_match_context_free(mcontext);
         pcre2_jit_stack_free(jit_stack);


JIT FAST PATH API

       Because the API described above falls back to interpreted matching when
       JIT is not available, it is convenient for programs  that  are  written
       for  general  use  in  many  environments.  However,  calling  JIT  via
       pcre2_match() does have a performance impact. Programs that are written
       for  use  where  JIT  is known to be available, and which need the best
       possible performance, can instead use a "fast path"  API  to  call  JIT
       matching  directly instead of calling pcre2_match() (obviously only for
       patterns that have been successfully processed by pcre2_jit_compile()).

       The fast path  function  is  called  pcre2_jit_match(),  and  it  takes
       exactly the same arguments as pcre2_match(). The return values are also
       the same, plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or
       complete)  is  requested that was not compiled. Unsupported option bits
       (for example, PCRE2_ANCHORED) are ignored.

       When you call pcre2_match(), as well as testing for invalid options,  a
       number of other sanity checks are performed on the arguments. For exam-
       ple, if the subject pointer is NULL, an immediate error is given. Also,
       unless  PCRE2_NO_UTF_CHECK  is  set, a UTF subject string is tested for
       validity. In the interests of speed, these checks do not happen on  the
       JIT fast path, and if invalid data is passed, the result is undefined.

       Bypassing  the  sanity  checks  and the pcre2_match() wrapping can give
       speedups of more than 10%.


SEE ALSO

       pcre2api(3)


AUTHOR

       Philip Hazel (FAQ by Zoltan Herczeg)
       University Computing Service
       Cambridge, England.


REVISION

       Last updated: 27 November 2014
       Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------


PCRE2LIMITS(3)             Library Functions Manual             PCRE2LIMITS(3)



NAME
       PCRE2 - Perl-compatible regular expressions (revised API)

SIZE AND OTHER LIMITATIONS

       There are some size limitations in PCRE2 but it is hoped that they will
       never in practice be relevant.

       The maximum size of a compiled pattern is approximately 64K code  units
       for  the  8-bit  and  16-bit  libraries  if  PCRE2 is compiled with the
       default internal linkage size, which is 2 bytes for these libraries. If
       you  want  to  process regular expressions that are truly enormous, you
       can compile PCRE2 with an internal linkage size of 3 or 4 (when  build-
       ing  the  16-bit library, 3 is rounded up to 4). See the README file in
       the source distribution and the pcre2build documentation  for  details.
       In  these  cases the limit is substantially larger.  However, the speed
       of execution is slower. In the 32-bit  library,  the  internal  linkage
       size is always 4.

       The maximum length (in code units) of a subject string is one less than
       the largest number a PCRE2_SIZE variable can  hold.  PCRE2_SIZE  is  an
       unsigned  integer  type,  usually  defined as size_t. Its maximum value
       (that is ~(PCRE2_SIZE)0) is reserved as a special indicator  for  zero-
       terminated strings and unset offsets.

       Note  that  when  using  the  traditional matching function, PCRE2 uses
       recursion to handle subpatterns and indefinite repetition.  This  means
       that  the  available stack space may limit the size of a subject string
       that can be processed by certain patterns. For a  discussion  of  stack
       issues, see the pcre2stack documentation.

       All values in repeating quantifiers must be less than 65536.

       There is no limit to the number of parenthesized subpatterns, but there
       can be no more than 65535 capturing subpatterns. There is,  however,  a
       limit  to  the  depth  of  nesting  of parenthesized subpatterns of all
       kinds. This is imposed in order to limit the  amount  of  system  stack
       used  at  compile time. The limit can be specified when PCRE2 is built;
       the default is 250.

       There is a limit to the number of forward references to subsequent sub-
       patterns  of  around  200,000.  Repeated  forward references with fixed
       upper limits, for example, (?2){0,100} when subpattern number 2  is  to
       the  right,  are included in the count. There is no limit to the number
       of backward references.

       The maximum length of name for a named subpattern is 32 code units, and
       the maximum number of named subpatterns is 10000.

       The  maximum  length  of  a  name  in  a (*MARK), (*PRUNE), (*SKIP), or
       (*THEN) verb is 255 for the 8-bit library and 65535 for the 16-bit  and
       32-bit libraries.


AUTHOR

       Philip Hazel
       University Computing Service
       Cambridge, England.


REVISION

       Last updated: 25 November 2014
       Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------


PCRE2MATCHING(3)           Library Functions Manual           PCRE2MATCHING(3)



NAME
       PCRE2 - Perl-compatible regular expressions (revised API)

PCRE2 MATCHING ALGORITHMS

       This document describes the two different algorithms that are available
       in PCRE2 for matching a compiled regular  expression  against  a  given
       subject  string.  The  "standard"  algorithm is the one provided by the
       pcre2_match() function. This works in the same as  as  Perl's  matching
       function,  and  provide a Perl-compatible matching operation. The just-
       in-time (JIT) optimization that is described in the pcre2jit documenta-
       tion is compatible with this function.

       An alternative algorithm is provided by the pcre2_dfa_match() function;
       it operates in a different way, and is not Perl-compatible. This alter-
       native  has  advantages  and  disadvantages  compared with the standard
       algorithm, and these are described below.

       When there is only one possible way in which a given subject string can
       match  a pattern, the two algorithms give the same answer. A difference
       arises, however, when there are multiple possibilities. For example, if
       the pattern

         ^<.*>

       is matched against the string

         <something> <something else> <something further>

       there are three possible answers. The standard algorithm finds only one
       of them, whereas the alternative algorithm finds all three.


REGULAR EXPRESSIONS AS TREES

       The set of strings that are matched by a regular expression can be rep-
       resented  as  a  tree structure. An unlimited repetition in the pattern
       makes the tree of infinite size, but it is still a tree.  Matching  the
       pattern  to a given subject string (from a given starting point) can be
       thought of as a search of the tree.  There are two  ways  to  search  a
       tree:  depth-first  and  breadth-first, and these correspond to the two
       matching algorithms provided by PCRE2.


THE STANDARD MATCHING ALGORITHM

       In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
       sions",  the  standard  algorithm  is an "NFA algorithm". It conducts a
       depth-first search of the pattern tree. That is, it  proceeds  along  a
       single path through the tree, checking that the subject matches what is
       required. When there is a mismatch, the algorithm  tries  any  alterna-
       tives  at  the  current point, and if they all fail, it backs up to the
       previous branch point in the  tree,  and  tries  the  next  alternative
       branch  at  that  level.  This often involves backing up (moving to the
       left) in the subject string as well.  The  order  in  which  repetition
       branches  are  tried  is controlled by the greedy or ungreedy nature of
       the quantifier.

       If a leaf node is reached, a matching string has  been  found,  and  at
       that  point the algorithm stops. Thus, if there is more than one possi-
       ble match, this algorithm returns the first one that it finds.  Whether
       this  is the shortest, the longest, or some intermediate length depends
       on the way the greedy and ungreedy repetition quantifiers are specified
       in the pattern.

       Because  it  ends  up  with a single path through the tree, it is rela-
       tively straightforward for this algorithm to keep  track  of  the  sub-
       strings  that  are  matched  by portions of the pattern in parentheses.
       This provides support for capturing parentheses and back references.


THE ALTERNATIVE MATCHING ALGORITHM

       This algorithm conducts a breadth-first search of  the  tree.  Starting
       from  the  first  matching  point  in the subject, it scans the subject
       string from left to right, once, character by character, and as it does
       this,  it remembers all the paths through the tree that represent valid
       matches. In Friedl's terminology, this is a kind  of  "DFA  algorithm",
       though  it is not implemented as a traditional finite state machine (it
       keeps multiple states active simultaneously).

       Although the general principle of this matching algorithm  is  that  it
       scans  the subject string only once, without backtracking, there is one
       exception: when a lookaround assertion is encountered,  the  characters
       following  or  preceding  the  current  point  have to be independently
       inspected.

       The scan continues until either the end of the subject is  reached,  or
       there  are  no more unterminated paths. At this point, terminated paths
       represent the different matching possibilities (if there are none,  the
       match  has  failed).   Thus,  if there is more than one possible match,
       this algorithm finds all of them, and in particular, it finds the long-
       est.  The  matches are returned in decreasing order of length. There is
       an option to stop the algorithm after the first match (which is  neces-
       sarily the shortest) is found.

       Note that all the matches that are found start at the same point in the
       subject. If the pattern

         cat(er(pillar)?)?

       is matched against the string "the caterpillar catchment",  the  result
       is  the  three  strings "caterpillar", "cater", and "cat" that start at
       the fifth character of the subject. The algorithm  does  not  automati-
       cally move on to find matches that start at later positions.

       PCRE2's "auto-possessification" optimization usually applies to charac-
       ter repeats at the end of a pattern (as well as internally). For  exam-
       ple, the pattern "a\d+" is compiled as if it were "a\d++" because there
       is no point even considering the possibility of backtracking  into  the
       repeated  digits.  For  DFA matching, this means that only one possible
       match is found. If you really do want multiple matches in  such  cases,
       either  use  an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS-
       SESS option when compiling.

       There are a number of features of PCRE2 regular  expressions  that  are
       not  supported  by the alternative matching algorithm. They are as fol-
       lows:

       1. Because the algorithm finds all  possible  matches,  the  greedy  or
       ungreedy  nature  of  repetition quantifiers is not relevant (though it
       may affect auto-possessification, as just described). During  matching,
       greedy  and  ungreedy  quantifiers are treated in exactly the same way.
       However, possessive quantifiers can make a difference when what follows
       could  also  match  what  is  quantified, for example in a pattern like
       this:

         ^a++\w!

       This pattern matches "aaab!" but not "aaa!", which would be matched  by
       a  non-possessive quantifier. Similarly, if an atomic group is present,
       it is matched as if it were a standalone pattern at the current  point,
       and  the  longest match is then "locked in" for the rest of the overall
       pattern.

       2. When dealing with multiple paths through the tree simultaneously, it
       is  not  straightforward  to  keep track of captured substrings for the
       different matching possibilities, and PCRE2's  implementation  of  this
       algorithm does not attempt to do this. This means that no captured sub-
       strings are available.

       3. Because no substrings are captured, back references within the  pat-
       tern are not supported, and cause errors if encountered.

       4.  For  the same reason, conditional expressions that use a backrefer-
       ence as the condition or test for a specific group  recursion  are  not
       supported.

       5.  Because  many  paths  through the tree may be active, the \K escape
       sequence, which resets the start of the match when encountered (but may
       be  on  some  paths  and not on others), is not supported. It causes an
       error if encountered.

       6. Callouts are supported, but the value of the  capture_top  field  is
       always 1, and the value of the capture_last field is always 0.

       7.  The  \C  escape  sequence, which (in the standard algorithm) always
       matches a single code unit, even in a UTF mode,  is  not  supported  in
       these  modes,  because the alternative algorithm moves through the sub-
       ject string one character (not code unit) at a  time,  for  all  active
       paths through the tree.

       8.  Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
       are not supported. (*FAIL) is supported, and  behaves  like  a  failing
       negative assertion.


ADVANTAGES OF THE ALTERNATIVE ALGORITHM

       Using  the alternative matching algorithm provides the following advan-
       tages:

       1. All possible matches (at a single point in the subject) are automat-
       ically  found,  and  in particular, the longest match is found. To find
       more than one match using the standard algorithm, you have to do kludgy
       things with callouts.

       2.  Because  the  alternative  algorithm  scans the subject string just
       once, and never needs to backtrack (except for lookbehinds), it is pos-
       sible  to  pass  very  long subject strings to the matching function in
       several pieces, checking for partial matching each time. Although it is
       also  possible  to  do  multi-segment matching using the standard algo-
       rithm, by retaining partially matched substrings, it  is  more  compli-
       cated. The pcre2partial documentation gives details of partial matching
       and discusses multi-segment matching.


DISADVANTAGES OF THE ALTERNATIVE ALGORITHM

       The alternative algorithm suffers from a number of disadvantages:

       1. It is substantially slower than  the  standard  algorithm.  This  is
       partly  because  it has to search for all possible matches, but is also
       because it is less susceptible to optimization.

       2. Capturing parentheses and back references are not supported.

       3. Although atomic groups are supported, their use does not provide the
       performance advantage that it does for the standard algorithm.


AUTHOR

       Philip Hazel
       University Computing Service
       Cambridge, England.


REVISION

       Last updated: 29 September 2014
       Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------


PCRE2PARTIAL(3)            Library Functions Manual            PCRE2PARTIAL(3)



NAME
       PCRE2 - Perl-compatible regular expressions

PARTIAL MATCHING IN PCRE2

       In  normal  use  of  PCRE2,  if  the subject string that is passed to a
       matching function matches as far as it goes, but is too short to  match
       the  entire pattern, PCRE2_ERROR_NOMATCH is returned. There are circum-
       stances where it might be helpful to distinguish this case  from  other
       cases in which there is no match.

       Consider, for example, an application where a human is required to type
       in data for a field with specific formatting requirements.  An  example
       might be a date in the form ddmmmyy, defined by this pattern:

         ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$

       If the application sees the user's keystrokes one by one, and can check
       that what has been typed so far is potentially valid,  it  is  able  to
       raise  an  error  as  soon  as  a  mistake  is made, by beeping and not
       reflecting the character that has been typed, for example. This immedi-
       ate  feedback is likely to be a better user interface than a check that
       is delayed until the entire string has been entered.  Partial  matching
       can  also be useful when the subject string is very long and is not all
       available at once.

       PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT  and
       PCRE2_PARTIAL_HARD  options,  which  can be set when calling a matching
       function.  The difference between the two options is whether or  not  a
       partial match is preferred to an alternative complete match, though the
       details differ between the two types  of  matching  function.  If  both
       options are set, PCRE2_PARTIAL_HARD takes precedence.

       If  you  want to use partial matching with just-in-time optimized code,
       you must call pcre2_jit_compile() with one or both of these options:

         PCRE2_JIT_PARTIAL_SOFT
         PCRE2_JIT_PARTIAL_HARD

       PCRE2_JIT_COMPLETE should also be set if you are going to run  non-par-
       tial  matches  on the same pattern. If the appropriate JIT mode has not
       been compiled, interpretive matching code is used.

       Setting a partial matching option  disables  two  of  PCRE2's  standard
       optimizations. PCRE2 remembers the last literal code unit in a pattern,
       and abandons matching immediately if it is not present in  the  subject
       string.  This  optimization  cannot  be  used for a subject string that
       might match only partially. PCRE2 also knows the minimum  length  of  a
       matching  string,  and  does not bother to run the matching function on
       shorter strings. This optimization is also disabled for partial  match-
       ing.


PARTIAL MATCHING USING pcre2_match()

       A  partial  match occurs during a call to pcre2_match() when the end of
       the subject string is reached successfully, but  matching  cannot  con-
       tinue because more characters are needed. However, at least one charac-
       ter in the subject must have been inspected. This  character  need  not
       form part of the final matched string; lookbehind assertions and the \K
       escape sequence provide ways of inspecting characters before the  start
       of  a matched string. The requirement for inspecting at least one char-
       acter exists because an empty string can  always  be  matched;  without
       such  a  restriction  there would always be a partial match of an empty
       string at the end of the subject.

       When a partial match is returned, the first two elements in the ovector
       point to the portion of the subject that was matched, but the values in
       the rest of the ovector are undefined. The appearance of \K in the pat-
       tern has no effect for a partial match. Consider this pattern:

         /abc\K123/

       If it is matched against "456abc123xyz" the result is a complete match,
       and the ovector defines the matched string as "123", because \K  resets
       the  "start  of  match" point. However, if a partial match is requested
       and the subject string is "456abc12", a partial match is found for  the
       string  "abc12",  because  all these characters are needed for a subse-
       quent re-match with additional characters.

       What happens when a partial match is identified depends on which of the
       two partial matching options are set.

   PCRE2_PARTIAL_SOFT WITH pcre2_match()

       If  PCRE2_PARTIAL_SOFT  is  set when pcre2_match() identifies a partial
       match, the partial match is remembered, but matching continues as  nor-
       mal,  and  other  alternatives in the pattern are tried. If no complete
       match  can  be  found,  PCRE2_ERROR_PARTIAL  is  returned  instead   of
       PCRE2_ERROR_NOMATCH.

       This  option  is "soft" because it prefers a complete match over a par-
       tial match.  All the various matching items in a pattern behave  as  if
       the  subject string is potentially complete. For example, \z, \Z, and $
       match at the end of the subject, as normal, and for \b and \B  the  end
       of the subject is treated as a non-alphanumeric.

       If  there  is more than one partial match, the first one that was found
       provides the data that is returned. Consider this pattern:

         /123\w+X|dogY/

       If this is matched against the subject string "abc123dog", both  alter-
       natives  fail  to  match,  but the end of the subject is reached during
       matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to  3
       and  9, identifying "123dog" as the first partial match that was found.
       (In this example, there are two partial matches, because "dog"  on  its
       own partially matches the second alternative.)

   PCRE2_PARTIAL_HARD WITH pcre2_match()

       If  PCRE2_PARTIAL_HARD is set for pcre2_match(), PCRE2_ERROR_PARTIAL is
       returned as soon as a partial match is  found,  without  continuing  to
       search  for possible complete matches. This option is "hard" because it
       prefers an earlier partial match over a later complete match. For  this
       reason,  the  assumption  is  made that the end of the supplied subject
       string may not be the true end of the available data, and  so,  if  \z,
       \Z,  \b, \B, or $ are encountered at the end of the subject, the result
       is PCRE2_ERROR_PARTIAL, provided that at least  one  character  in  the
       subject has been inspected.

   Comparing hard and soft partial matching

       The  difference  between the two partial matching options can be illus-
       trated by a pattern such as:

         /dog(sbody)?/

       This matches either "dog" or "dogsbody", greedily (that is, it  prefers
       the  longer  string  if  possible). If it is matched against the string
       "dog" with PCRE2_PARTIAL_SOFT, it yields a complete  match  for  "dog".
       However,  if  PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR-
       TIAL. On the other hand, if the pattern is made ungreedy the result  is
       different:

         /dog(sbody)??/

       In  this  case  the  result  is always a complete match because that is
       found first, and matching never  continues  after  finding  a  complete
       match. It might be easier to follow this explanation by thinking of the
       two patterns like this:

         /dog(sbody)?/    is the same as  /dogsbody|dog/
         /dog(sbody)??/   is the same as  /dog|dogsbody/

       The second pattern will never match "dogsbody", because it will  always
       find the shorter match first.


PARTIAL MATCHING USING pcre2_dfa_match()

       The DFA functions move along the subject string character by character,
       without backtracking, searching for  all  possible  matches  simultane-
       ously.  If the end of the subject is reached before the end of the pat-
       tern, there is the possibility of a partial match, again provided  that
       at least one character has been inspected.

       When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if
       there have been no complete matches. Otherwise,  the  complete  matches
       are  returned.   However, if PCRE2_PARTIAL_HARD is set, a partial match
       takes precedence over any complete matches. The portion of  the  string
       that was matched when the longest partial match was found is set as the
       first matching string.

       Because the DFA functions always search for all possible  matches,  and
       there  is  no  difference between greedy and ungreedy repetition, their
       behaviour is different from  the  standard  functions  when  PCRE2_PAR-
       TIAL_HARD  is  set.  Consider  the  string  "dog"  matched  against the
       ungreedy pattern shown above:

         /dog(sbody)??/

       Whereas the standard function stops as soon as it  finds  the  complete
       match  for  "dog",  the  DFA  function also finds the partial match for
       "dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set.


PARTIAL MATCHING AND WORD BOUNDARIES

       If a pattern ends with one of sequences \b or \B, which test  for  word
       boundaries,  partial matching with PCRE2_PARTIAL_SOFT can give counter-
       intuitive results. Consider this pattern:

         /\bcat\b/

       This matches "cat", provided there is a word boundary at either end. If
       the subject string is "the cat", the comparison of the final "t" with a
       following character cannot take place, so a  partial  match  is  found.
       However,  normal  matching carries on, and \b matches at the end of the
       subject when the last character is a letter, so  a  complete  match  is
       found.   The  result,  therefore,  is  not  PCRE2_ERROR_PARTIAL.  Using
       PCRE2_PARTIAL_HARD in this case does yield PCRE2_ERROR_PARTIAL, because
       then the partial match takes precedence.


EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST

       If  the  partial_soft  (or  ps) modifier is present on a pcre2test data
       line, the PCRE2_PARTIAL_SOFT option is used for the match.  Here  is  a
       run of pcre2test that uses the date example quoted above:

           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
         data> 25jun04\=ps
          0: 25jun04
          1: jun
         data> 25dec3\=ps
         Partial match: 23dec3
         data> 3ju\=ps
         Partial match: 3ju
         data> 3juj\=ps
         No match
         data> j\=ps
         No match

       The  first  data  string  is matched completely, so pcre2test shows the
       matched substrings. The remaining four strings do not  match  the  com-
       plete pattern, but the first two are partial matches. Similar output is
       obtained if DFA matching is used.

       If the partial_hard (or ph) modifier is present  on  a  pcre2test  data
       line, the PCRE2_PARTIAL_HARD option is set for the match.


MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()

       When  a  partial match has been found using a DFA matching function, it
       is possible to continue the match by providing additional subject  data
       and  calling  the function again with the same compiled regular expres-
       sion, this time setting the PCRE2_DFA_RESTART option. You must pass the
       same working space as before, because this is where details of the pre-
       vious partial match are stored. Here is an example using pcre2test:

           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
         data> 23ja\=dfa,ps
         Partial match: 23ja
         data> n05\=dfa,dfa_restart
          0: n05

       The first call has "23ja" as the subject, and requests  partial  match-
       ing;  the  second  call  has  "n05"  as  the  subject for the continued
       (restarted) match.  Notice that when the match is  complete,  only  the
       last  part  is  shown;  PCRE2 does not retain the previously partially-
       matched string. It is up to the calling program to do that if it  needs
       to.

       That means that, for an unanchored pattern, if a continued match fails,
       it is not possible to try again at  a  new  starting  point.  All  this
       facility  is  capable  of  doing  is continuing with the previous match
       attempt. In the previous example, if the second set of data  is  "ug23"
       the  result is no match, even though there would be a match for "aug23"
       if the entire string were given at once. Depending on the  application,
       this may or may not be what you want.  The only way to allow for start-
       ing again at the next character is to retain the matched  part  of  the
       subject and try a new complete match.

       You  can  set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with
       PCRE2_DFA_RESTART to continue partial matching over multiple  segments.
       This  facility can be used to pass very long subject strings to the DFA
       matching functions.


MULTI-SEGMENT MATCHING WITH pcre2_match()

       Unlike the DFA function, it is not possible  to  restart  the  previous
       match with a new segment of data when using pcre2_match(). Instead, new
       data must be added to the previous subject string, and the entire match
       re-run,  starting from the point where the partial match occurred. Ear-
       lier data can be discarded.

       It is best to use PCRE2_PARTIAL_HARD in this situation, because it does
       not  treat the end of a segment as the end of the subject when matching
       \z, \Z, \b, \B, and $. Consider  an  unanchored  pattern  that  matches
       dates:

           re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
         data> The date is 23ja\=ph
         Partial match: 23ja

       At  this stage, an application could discard the text preceding "23ja",
       add on text from the next  segment,  and  call  the  matching  function
       again.  Unlike  the  DFA  matching function, the entire matching string
       must always be available, and the complete matching process occurs  for
       each call, so more memory and more processing time is needed.


ISSUES WITH MULTI-SEGMENT MATCHING

       Certain types of pattern may give problems with multi-segment matching,
       whichever matching function is used.

       1. If the pattern contains a test for the beginning of a line, you need
       to  pass  the  PCRE2_NOTBOL option when the subject string for any call
       does start at the beginning of a line. There  is  also  a  PCRE2_NOTEOL
       option, but in practice when doing multi-segment matching you should be
       using PCRE2_PARTIAL_HARD, which includes the effect of PCRE2_NOTEOL.

       2. If a pattern contains a lookbehind assertion, characters  that  pre-
       cede  the start of the partial match may have been inspected during the
       matching process.  When using pcre2_match(), sufficient characters must
       be  retained  for  the  next  match attempt. You can ensure that enough
       characters are retained by doing the following:

       Before doing any matching, find the length of the longest lookbehind in
       the     pattern    by    calling    pcre2_pattern_info()    with    the
       PCRE2_INFO_MAXLOOKBEHIND option. Note that the resulting  count  is  in
       characters, not code units. After a partial match, moving back from the
       ovector[0] offset in the subject by the number of characters given  for
       the  maximum lookbehind gets you to the earliest character that must be
       retained. In a non-UTF or a 32-bit situation, moving  back  is  just  a
       subtraction,  but in UTF-8 or UTF-16 you have to count characters while
       moving back through the code units.

       Characters before the point you have now reached can be discarded,  and
       after  the  next segment has been added to what is retained, you should
       run the next match with the startoffset argument set so that the  match
       begins at the same point as before.

       For  example, if the pattern "(?<=123)abc" is partially matched against
       the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maxi-
       mum  lookbehind  count  is  3, so all characters before offset 2 can be
       discarded. The value of startoffset for the next  match  should  be  3.
       When  pcre2test  displays  a partial match, it indicates the lookbehind
       characters with '<' characters:

           re> "(?<=123)abc"
         data> xx123ab\=ph
         Partial match: 123ab
                        <<<

       3. Because a partial match must always contain at least one  character,
       what  might  be  considered a partial match of an empty string actually
       gives a "no match" result. For example:

           re> /c(?<=abc)x/
         data> ab\=ps
         No match

       If the next segment begins "cx", a match should be found, but this will
       only  happen  if characters from the previous segment are retained. For
       this reason, a "no match" result  should  be  interpreted  as  "partial
       match of an empty string" when the pattern contains lookbehinds.

       4.  Matching  a subject string that is split into multiple segments may
       not always produce exactly the same result as matching over one  single
       long  string,  especially  when PCRE2_PARTIAL_SOFT is used. The section
       "Partial Matching and Word Boundaries" above describes  an  issue  that
       arises  if  the  pattern ends with \b or \B. Another kind of difference
       may occur when there are multiple matching possibilities, because  (for
       PCRE2_PARTIAL_SOFT) a partial match result is given only when there are
       no completed matches. This means that as soon as the shortest match has
       been  found,  continuation to a new subject segment is no longer possi-
       ble. Consider this pcre2test example:

           re> /dog(sbody)?/
         data> dogsb\=ps
          0: dog
         data> do\=ps,dfa
         Partial match: do
         data> gsb\=ps,dfa,dfa_restart
          0: g
         data> dogsbody\=dfa
          0: dogsbody
          1: dog

       The first data line passes the string "dogsb" to  a  standard  matching
       function, setting the PCRE2_PARTIAL_SOFT option. Although the string is
       a partial match for "dogsbody", the result is not  PCRE2_ERROR_PARTIAL,
       because  the  shorter string "dog" is a complete match. Similarly, when
       the subject is presented to a DFA matching function  in  several  parts
       ("do"  and  "gsb"  being  the first two) the match stops when "dog" has
       been found, and it is not possible to continue.  On the other hand,  if
       "dogsbody"  is  presented  as  a single string, a DFA matching function
       finds both matches.

       Because of these problems, it is best to  use  PCRE2_PARTIAL_HARD  when
       matching  multi-segment  data.  The  example above then behaves differ-
       ently:

           re> /dog(sbody)?/
         data> dogsb\=ph
         Partial match: dogsb
         data> do\=ps,dfa
         Partial match: do
         data> gsb\=ph,dfa,dfa_restart
         Partial match: gsb

       5. Patterns that contain alternatives at the top level which do not all
       start  with  the  same  pattern  item  may  not  work  as expected when
       PCRE2_DFA_RESTART is used. For example, consider this pattern:

         1234|3789

       If the first part of the subject is "ABC123", a partial  match  of  the
       first  alternative  is found at offset 3. There is no partial match for
       the second alternative, because such a match does not start at the same
       point  in  the  subject  string. Attempting to continue with the string
       "7890" does not yield a match  because  only  those  alternatives  that
       match  at  one  point in the subject are remembered. The problem arises
       because the start of the second alternative matches  within  the  first
       alternative.  There  is  no  problem with anchored patterns or patterns
       such as:

         1234|ABCD

       where no string can be a partial match for both alternatives.  This  is
       not  a  problem  if  a  standard matching function is used, because the
       entire match has to be rerun each time:

           re> /1234|3789/
         data> ABC123\=ph
         Partial match: 123
         data> 1237890
          0: 3789

       Of course, instead of using PCRE2_DFA_RESTART, the  same  technique  of
       re-running  the  entire  match  can  also be used with the DFA matching
       function. Another possibility is to work with two buffers. If a partial
       match  at  offset  n in the first buffer is followed by "no match" when
       PCRE2_DFA_RESTART is used on the second buffer, you can then try a  new
       match starting at offset n+1 in the first buffer.


AUTHOR

       Philip Hazel
       University Computing Service
       Cambridge, England.


REVISION

       Last updated: 22 December 2014
       Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------


PCRE2UNICODE(3)            Library Functions Manual            PCRE2UNICODE(3)



NAME
       PCRE - Perl-compatible regular expressions (revised API)

UNICODE AND UTF SUPPORT

       When PCRE2 is built with Unicode support (which is the default), it has
       knowledge of Unicode character properties and can process text  strings
       in  UTF-8, UTF-16, or UTF-32 format (depending on the code unit width).
       However, by default, PCRE2 assumes that one code unit is one character.
       To  process  a  pattern  as a UTF string, where a character may require
       more than one  code  unit,  you  must  call  pcre2_compile()  with  the
       PCRE2_UTF  option  flag,  or  the  pattern must start with the sequence
       (*UTF). When either of these is the case, both the pattern and any sub-
       ject  strings  that  are  matched against it are treated as UTF strings
       instead of strings of individual one-code-unit characters.

       If you do not need Unicode support you can build PCRE2 without  it,  in
       which case the library will be smaller.


UNICODE PROPERTY SUPPORT

       When  PCRE2 is built with Unicode support, the escape sequences \p{..},
       \P{..}, and \X can be used. The Unicode properties that can  be  tested
       are  limited to the general category properties such as Lu for an upper
       case letter or Nd for a decimal number, the Unicode script  names  such
       as Arabic or Han, and the derived properties Any and L&. Full lists are
       given in the pcre2pattern and pcre2syntax documentation. Only the short
       names  for  properties are supported. For example, \p{L} matches a let-
       ter. Its Perl synonym, \p{Letter}, is not supported.   Furthermore,  in
       Perl,  many properties may optionally be prefixed by "Is", for compati-
       bility with Perl 5.6. PCRE does not support this.


WIDE CHARACTERS AND UTF MODES

       Codepoints less than 256 can be specified in patterns by either  braced
       or unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3).
       Larger values have to use braced sequences. Unbraced octal code  points
       up to \777 are also recognized; larger ones can be coded using \o{...}.

       In  UTF modes, repeat quantifiers apply to complete UTF characters, not
       to individual code units.

       In UTF modes, the dot metacharacter matches one UTF  character  instead
       of a single code unit.

       The  escape  sequence  \C can be used to match a single code unit, in a
       UTF mode, but its use can lead  to  some  strange  effects  because  it
       breaks  up  multi-unit  characters  (see  the  description of \C in the
       pcre2pattern documentation). The use of \C  is  not  supported  in  the
       alternative matching function pcre2_dfa_match(), nor is it supported in
       UTF mode by the JIT optimization. If JIT optimization is requested  for
       a  UTF pattern that contains \C, it will not succeed, and so the match-
       ing will be carried out by the normal interpretive function.

       The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
       characters  of  any  code  value,  but, by default, the characters that
       PCRE2 recognizes as digits, spaces, or word characters remain the  same
       set  as  in  non-UTF  mode,  all  with  code points less than 256. This
       remains true even when PCRE2  is  built  to  include  Unicode  support,
       because  to do otherwise would slow down matching in many common cases.
       Note that this also applies to \b and \B, because they are  defined  in
       terms  of  \w  and  \W.  If you want to test for a wider sense of, say,
       "digit", you can use explicit Unicode property tests  such  as  \p{Nd}.
       Alternatively,  if you set the PCRE2_UCP option, the way that the char-
       acter escapes work is changed so that Unicode properties  are  used  to
       determine which characters match. There are more details in the section
       on generic character types in the pcre2pattern documentation.

       Similarly, characters that match the POSIX named character classes  are
       all low-valued characters, unless the PCRE2_UCP option is set.

       However,  the  special  horizontal  and  vertical  white space matching
       escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
       acters, whether or not PCRE2_UCP is set.

       Case-insensitive  matching in UTF mode makes use of Unicode properties.
       A few Unicode characters such as Greek sigma have more than  two  code-
       points that are case-equivalent, and these are treated as such.


VALIDITY OF UTF STRINGS

       When  the  PCRE2_UTF  option is set, the strings passed as patterns and
       subjects are (by default) checked for validity on entry to the relevant
       functions.   If an invalid UTF string is passed, an negative error code
       is returned. The code unit offset to the  offending  character  can  be
       extracted  from  the match data block by calling pcre2_get_startchar(),
       which is used for this purpose after a UTF error.

       UTF-16 and UTF-32 strings can indicate their endianness by special code
       knows  as  a  byte-order  mark (BOM). The PCRE2 functions do not handle
       this, expecting strings to be in host byte order.

       The entire string is checked before any other processing  takes  place.
       In  addition  to checking the format of the string, there is a check to
       ensure that all code points lie in the range U+0 to U+10FFFF, excluding
       the  surrogate area.  The so-called "non-character" code points are not
       excluded because Unicode corrigendum #9 makes it clear that they should
       not be.

       Characters  in  the "Surrogate Area" of Unicode are reserved for use by
       UTF-16, where they are used in pairs to encode code points with  values
       greater  than  0xFFFF. The code points that are encoded by UTF-16 pairs
       are available independently in the  UTF-8  and  UTF-32  encodings.  (In
       other  words,  the  whole  surrogate  thing is a fudge for UTF-16 which
       unfortunately messes up UTF-8 and UTF-32.)

       In some situations, you may already know that your strings  are  valid,
       and  therefore  want  to  skip these checks in order to improve perfor-
       mance, for example in the case of a long subject string that  is  being
       scanned  repeatedly.   If you set the PCRE2_NO_UTF_CHECK option at com-
       pile time or at match time, PCRE2 assumes that the pattern  or  subject
       it is given (respectively) contains only valid UTF code unit sequences.

       Passing  PCRE2_NO_UTF_CHECK  to pcre2_compile() just disables the check
       for the pattern; it does not also apply to subject strings. If you want
       to  disable the check for a subject string you must pass this option to
       pcre2_match() or pcre2_dfa_match().

       If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is  set,  the
       result is undefined and your program may crash or loop indefinitely.

   Errors in UTF-8 strings

       The following negative error codes are given for invalid UTF-8 strings:

         PCRE2_ERROR_UTF8_ERR1
         PCRE2_ERROR_UTF8_ERR2
         PCRE2_ERROR_UTF8_ERR3
         PCRE2_ERROR_UTF8_ERR4
         PCRE2_ERROR_UTF8_ERR5

       The  string  ends  with a truncated UTF-8 character; the code specifies
       how many bytes are missing (1 to 5). Although RFC 3629 restricts  UTF-8
       characters  to  be  no longer than 4 bytes, the encoding scheme (origi-
       nally defined by RFC 2279) allows for  up  to  6  bytes,  and  this  is
       checked first; hence the possibility of 4 or 5 missing bytes.

         PCRE2_ERROR_UTF8_ERR6
         PCRE2_ERROR_UTF8_ERR7
         PCRE2_ERROR_UTF8_ERR8
         PCRE2_ERROR_UTF8_ERR9
         PCRE2_ERROR_UTF8_ERR10

       The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
       the character do not have the binary value 0b10 (that  is,  either  the
       most significant bit is 0, or the next bit is 1).

         PCRE2_ERROR_UTF8_ERR11
         PCRE2_ERROR_UTF8_ERR12

       A  character that is valid by the RFC 2279 rules is either 5 or 6 bytes
       long; these code points are excluded by RFC 3629.

         PCRE2_ERROR_UTF8_ERR13

       A 4-byte character has a value greater than 0x10fff; these code  points
       are excluded by RFC 3629.

         PCRE2_ERROR_UTF8_ERR14

       A  3-byte  character  has  a  value in the range 0xd800 to 0xdfff; this
       range of code points are reserved by RFC 3629 for use with UTF-16,  and
       so are excluded from UTF-8.

         PCRE2_ERROR_UTF8_ERR15
         PCRE2_ERROR_UTF8_ERR16
         PCRE2_ERROR_UTF8_ERR17
         PCRE2_ERROR_UTF8_ERR18
         PCRE2_ERROR_UTF8_ERR19

       A  2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
       for a value that can be represented by fewer bytes, which  is  invalid.
       For  example,  the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
       rect coding uses just one byte.

         PCRE2_ERROR_UTF8_ERR20

       The two most significant bits of the first byte of a character have the
       binary  value 0b10 (that is, the most significant bit is 1 and the sec-
       ond is 0). Such a byte can only validly occur as the second  or  subse-
       quent byte of a multi-byte character.

         PCRE2_ERROR_UTF8_ERR21

       The  first byte of a character has the value 0xfe or 0xff. These values
       can never occur in a valid UTF-8 string.

   Errors in UTF-16 strings

       The following  negative  error  codes  are  given  for  invalid  UTF-16
       strings:

         PCRE_UTF16_ERR1  Missing low surrogate at end of string
         PCRE_UTF16_ERR2  Invalid low surrogate follows high surrogate
         PCRE_UTF16_ERR3  Isolated low surrogate


   Errors in UTF-32 strings

       The  following  negative  error  codes  are  given  for  invalid UTF-32
       strings:

         PCRE_UTF32_ERR1  Surrogate character (range from 0xd800 to 0xdfff)
         PCRE_UTF32_ERR2  Code point is greater than 0x10ffff


AUTHOR

       Philip Hazel
       University Computing Service
       Cambridge, England.


REVISION

       Last updated: 23 November 2014
       Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------


