Advanced configuration

Database configuration

GeoNetwork has two options for pooled connections to the relational database:

  1. Manage its own database configuration and pool directly using Apache Commons Database Connection Pool (DBCP)
  2. Use database configuration and pool managed by the web application server (also known as the container) such as tomcat or jetty via the Java Naming and Directory Interface (JNDI).

Managing the database connection pool through Apache DBCP

This option is the one that most users would use as it is the default option for managing the database in GeoNetwork. A typical configuration in the resources element of INSTALL_DIR/web/geonetwork/WEB-INF/config.xml uses the jeeves.resources.dbms.ApacheDBCPool class and looks something like:

<resource enabled="true">
        <name>main-db</name>
        <provider>jeeves.resources.dbms.ApacheDBCPool</provider>
        <config>
                <user>www-data</user>
                <password>www-data</password>
                <driver>org.postgis.DriverWrapper</driver>
                <url>jdbc:postgresql_postGIS://localhost:5432/geonetwork</url>
                <poolSize>10</poolSize>
                <validationQuery>SELECT 1</validationQuery>
        </config>
</resource>

The parameters that can be specified to control the Apache Database Connection Pool are described at http://commons.apache.org/dbcp/configuration.html. You can configure a subset of these parameters in your resource element. The parameters that can be specified are:

Parameter Description Default
maxActive pool size/maximum number of active connections 10
maxIdle maximum number of idle connections maxActive
minIdle minimum number of idle connections 0
maxWait number of milliseconds to wait for a connection to become available 200
validationQuery sql statement for verifying a connection, must return a least one row no default
timeBetweenEvictionRunsMillis time between eviction runs (-1 means next three params are ignored) -1
testWhileIdle test connections when idle false
minEvictableIdleTimeMillis idle time before connection can be evicted 30 x 60 x 1000 msecs
numTestsPerEvictionRun number of connections tested per eviction run 3
defaultTransactionIsolation see http://en.wikipedia.org/wiki/Isolation_%28database_systems%29 READ_COMMITTED

The following parameters are set by GeoNetwork and cannot be configured by the user:

  • removeAbandoned - true
  • removeAbandonedTimeout - 60 x 60 seconds = 1 hour
  • logAbandoned - true
  • testOnBorrow - true
  • defaultReadOnly - false
  • defaultAutoCommit - false
  • initialSize - maxActive

Note: Some firewalls kill idle connections to databases after say 1 hour (= 3600 secs). To keep idle connections alive by testing them with the validationQuery, set minEvictableIdleTimeMillis to something less than timeout, interval (eg. 2 mins = 120 secs = 120000 millisecs), set testWhileIdle to true and set timeBetweenEvictionRunsMillis and numTestsPerEvictionRun high enough to visit connections frequently eg 15 mins = 900 secs = 900000 millisecs and 4 connections per test. For example:

<testWhileIdle>true</testWhileIdle>
<minEvictableIdleTimeMillis>120000</minEvictableIdleTimeMillis>
<timeBetweenEvictionRunsMillis>900000</timeBetweenEvictionRunsMillis>
<numTestsPerEvictionRun>4</numTestsPerEvictionRun>

Note:

  • When GeoNetwork manages the database connection pool, PostGIS database is the only database that can hold the spatial index in the database. All other database choices hold the spatial index as a shapefile. If using PostGIS, two pools of database connections are created. The first is managed and configured using parameters in this section, the second is created by GeoTools and cannot be configured. This approach is now deprecated: if you want to use the database to hold the spatial index you should use the JNDI configuration described in the next section because it uses a single, configurable database pool through GeoTools as well as the more modern NG (Next Generation) GeoTools datastore factories.
  • For more on transaction isolation see http://en.wikipedia.org/wiki/Isolation_%28database_systems%29.

Database connection pool managed by the container

A typical configuration in the resources element of INSTALL_DIR/web/geonetwork/WEB-INF/config.xml uses the jeeves.resources.dbms.JNDIPool class and looks something like:

<resource enabled="true">
        <name>main-db</name>
        <provider>jeeves.resources.dbms.JNDIPool</provider>
        <config>
                <context>java:/comp/env</context>
                <resourceName>jdbc/geonetwork</resourceName>
                <url>jdbc:oracle:thin:@localhost:1521:oracle-XE</url>
                <provideDataStore>true</provideDataStore>
        </config>
</resource>

The configuration parameters and their meanings are as follows:

Config Parameter Description
context The name of the context from which to obtain the resource - almost always this is java:/comp/env
resourceName The name of the resource in the context to use
url The URL of the database - this is needed to let GeoTools know the database type
provideDataStore If set to true then the database will be used for the spatial index, otherwise a shapefile will be used

The remainder of the configuration is done in the container context. eg. for tomcat this configuration is in conf/context.xml in the resource called jdbc/geonetwork. Here is an example for the Oracle database:

<Resource name="jdbc/geonetwork"
        auth="Container"
        type="javax.sql.DataSource"
        username="system"
        password="oracle"
        factory="org.apache.commons.dbcp.BasicDataSourceFactory"
        driverClassName="oracle.jdbc.OracleDriver"
        url="jdbc:oracle:thin:@localhost:1521:XE"
        maxActive="10"
        maxIdle="10"
        removeAbandoned="true"
        removeAbandonedTimeout="3600"
        logAbandoned="true"
        testOnBorrow="true"
        defaultAutoCommit="false"
        validationQuery="SELECT 1 FROM DUAL"
        accessToUnderlyingConnectionAllowed="true"
        poolPreparedStatements="true"
        maxOpenPreparedStatements="400"
/>

The parameters that can be specified to control the Apache Database Connection Pool used by the container are described at http://commons.apache.org/dbcp/configuration.html.

The following parameters should be set to ensure GeoNetwork operates correctly:

  • defaultAutoCommit=”false”
  • accessToUnderlyingConnectionAllowed=”true”

For performance reasons you should set the following parameters:

  • poolPreparedStatements=”true”
  • maxOpenPreparedStatements=”300” (at least)

Notes:

  • both PostGIS and Oracle will build and use a table in the database for the spatialindex if provideDataStore is set to true. Other databases could be made to do the same if a spatialindex table is created - see the definition for the spatialIndex table in INSTALL_DIR/web/geonetwork/WEB-INF/classes/setup/sql/create/create-db-postgis.sql for example.
  • you should install commons-dbcp-1.3.jar and commons-pool-1.5.5.jar in the container class path (eg. common/lib for tomcat5) as the only supported DataSourceFactory in geotools is apache commons dbcp. Naturally you should always use the factory=”org.apache.commons.dbcp.BasicDataSourceFactory” in the JNDI context as well.
  • the default tomcat-dbcp.jar version of apache commons dbcp appears to work correctly for geotools and PostGIS but does not work for those databases that need to unwrap the connection in order to do spatial operations (eg. Oracle).
  • Oracle ojdbc-14.jar or ojdbc5.jar or ojdbc6.jar (depending on the version of Java being used) should also be installed in the container common/lib area (you may need to remove ojdbc-14.jar that appears in INSTALL_DIR/web/geonetwork/WEB-INF/lib.)
  • advanced: you should check the default transaction isolation level for your database driver. READ_COMMITTED appears to be a safe level of isolation to use with GeoNetwork for commonly used databases. Also note that McKoi can only support SERIALIZABLE (does anyone still use McKoi?). For more on transaction isolation see http://en.wikipedia.org/wiki/Isolation_%28database_systems%29.

Advanced configuration for larger catalogs

There are a number of steps you must consider if you are going to use GeoNetwork for catalogs with 20,000 or more metadata records:

  1. Consider the hardware you have available GeoNetwork uses a database as a transactional store and does all metadata searches using Lucene. Lucene is very fast and scale for large catalogs if you supply fast disk (solid state disk is best by far), lots of memory/RAM (16Gb+) and multiple processors as part of a 64bit environment. Linux is probably the best operating system to take advantage of such an environment.

  2. Build the spatial index into your database ie. Use PostGIS (Postgres+PostGIS) or Oracle as your database GeoNetwork has to build a spatial index containing all metadata bounding boxes and polygons, in order to support spatial queries for the Catalog Services Web (CSW) interface eg. select all metadata records that intersect a search polygon. By default GeoNetwork uses a shapefile but the shapefile quickly becomes costly to maintain during reindexing usually after the number of records in the catalog exceeds 20,000. If you select PostGIS or Oracle as your database via JNDI (see previous section), GeoNetwork will build the spatial index in a table (called spatialindex). The spatialindex table in the database is much faster to reindex. But more importantly, if appropriate database hardware and configuration steps are taken, it should also be faster to query than the shapefile when the number of records in the catalog becomes very large.

  3. Consider the Java heap space Typically as much memory as you can give GeoNetwork is the answer here. If you have a 32bit machine then you are stuck below 2Gb (or maybe a little higher with some hacks). A 64bit machine is best for large catalogs. Jetty users can set the Java heap space in INSTALL_DIR/bin/start-geonetwork.sh (see the -Xmx option: eg. -Xmx4g will set the heap space to 4Gb on a 64bit machine). Tomcat users can set an environment variable JAVA_OPTS eg. export JAVA_OPTS=”-Xmx4g”

  4. Consider the number of processors you wish to allocate to GeoNetwork GeoNetwork 2.8 allows you to use more than one system processor (or core) to speed up reindexing and batch operations on large numbers of metadata records. The records to be processed are split into groups with each group assigned to an execution thread. You can specify how many threads can be used in the system configuration menu. A reasonable value for the number of threads is the number of processors or cores you have allocated to the GeoNetwork Java Virtual Machine (JVM) or just the number of processors on the machine that you have dedicated to GeoNetwork.

  5. Consider the number of database connections to be allocated to GeoNetwork GeoNetwork uses and reuses a pool of database connections. This is configured in INSTALL_DIR/web/geonetwork/WEB-INF/config.xml or in the container via JNDI. To arrive at a reasonable number for the pool size is not straight forward. You need to consider the number of concurrent harvesters you will run, the number of concurrent batch import and batch operations you expect to run and the number of concurrent users you are expecting to arrive. The default value of 10 is really only for small sites. The more connections you can allocate, the less time your users and other tasks will spend waiting for a free connection.

  6. Consider the maximum number of files your system will allow any process to have open Most operating systems will only allow a process to open a limited number of files. If you are expecting a large number of records to be in your catalog then you should change the default value to something larger (eg. 4096) as the lucene index in GeoNetwork will occasionally require large numbers of open files during reindexing. In Linux this value can be changed using the ulimit command (ulimit -a typically shows you the current setting). Find a value that suits your needs and add the appropriate ulimit command (eg. ulimit -n 4096) to the GeoNetwork startup script to make sure that the new limit is used when GeoNetwork is started.

  7. Raise the stack size limit for the postgres database Each process has some memory allocated as a stack. The stack is used to store process arguments and variables as well as state when functions are called. Most operating systems limit the size that the stack can grow to. With large catalogs and spatial searches, very large SQL queries can be generated on the PostGIS spatial index table. This can cause postgres to exceed the process stack size limit (typically 8192k on smaller machines). You will know when this happens because a very long SQL query will be output to the GeoNetwork log file prefixed with a cryptic message something along the lines of:

    java.util.NoSuchElementException: Could not acquire
    feature:org.geotools.data.DataSourceException: Error Performing SQL query: SELECT .........

    In Linux the stack size can be changed using the ulimit command (ulimit -a typically shows you the current setting). You will need to choose a value and set it (eg. ulimit -s 262140) in the shell startup script of the postgres user (eg. .bashrc if using the bash shell). The setting may also need to be added to the postgres config - see “max_stack_depth” in the postgresql.conf file for your system. You may also have to enable to postgres user to change the stack size in /etc/security/limits.conf. After this has been done, restart postgres.

  8. If you need to support a catalog with more than 1 million records GeoNetwork creates a directory for each record that in turn contains a public and a private directory for holding attached data and thumbnails. These directories are in the GeoNetwork data directory - typically: INSTALL_DIR/web/geonetwork/WEB-INF/data. This can exhaust the number of inodes available in a Linux file system (you will often see misleading error reports saying that the filesystem is ‘out of space’ - even though the filesystem may have lots of freespace). Check this using df -i. Since inodes are allocated statically when the filesystem is created for most common filesystems (including extfs4), it is rather inconvenient to have to backup all your data and recreate the filesystem! So if you are planning a large catalog with over 1 million records, make sure that you create a filesystem on your machine with the number of inodes set to at least 5x (and to be safe 10x) the number of records you are expecting to hold and let GeoNetwork create its data directory on that filesystem.