The Safe Software Blog
Author:
Mark Ireland

Google

Get the Blog Newsletter

Delivered by FeedBurner

About FME    |   May 13, 2011   |   By Mark Ireland

FME2011 Use Case: Joiner vs FeatureMerger

Hi FME’ers,
Interacting with FME users I see various points of view on the merits of the Joiner transformer versus the FeatureMerger. Although both transformers carry out similar actions, it’s not clear to users when you should use each of these – particularly in relation to workspace performance.

So this post will indulge in some investigative journalism! I’ll compare and contrast these transformers, to see where you would want to use each of them, and throw other transformers – such as the updated SQLExecutor and the new FeatureReader - into the mix as potential alternatives.

Descriptions
First a description. In general these transformers are used to merge features together. For example, you have a Shape dataset of address points, a database of related records, and want to join the two sets of information into one. That’s when you would use one of these transformers.

To do the merge requires some common information; usually a common ID number.

FeatureMerger
The FeatureMerger is for when both sets of data are being read in a workspace. It takes input into two ports – Requestor and Supplier – and merges information from each supplier onto the requestor with the same ID number.

Joiner
The Joiner has only a single input port. The port is called ‘INPUT’ but you might think of it as the ‘requester’. It requests information not from other FME features, but directly from records in a database; again using a common ID number to carry out the match.

FeatureReader + SpatialFilter
The FeatureReader does a similar job to the Joiner, but allows matching of features by spatial relationship rather than a common attribute value. The SpatialFilter would do a similar job where both sets of features are already read into the workspace, making it the spatial equivalent of the FeatureMerger.

SQLExecutor
The SQLExecutor carries out a SQL command on a database – that command could be a select statement with a where clause, effectively allowing joins to be carried out.

Performance Tests
The test data originally consisted of 12,292 Address Records, but these were cloned 100 times to give a bigger dataset (1,229,200 features) and split into spatial (point features) and non-spatial (text records/attributes). A common ID (AddressID) identified the relationship between the two data types.

For format, the spatial data was written to Shape, but the non-spatial to both CSV and SQLServer.


Test 1
Test 1 was a simple test to recombine the CSV attributes back with the Shape spatial features. The FeatureReader and SQLExecutor aren’t valid here because they only do a match with a true database format.

Time

Max Memory

Joiner

5′ 41″

299,692kb

FeatureMerger

12′ 21″

873,732kb

So here the Joiner wins out: it’s quicker and uses far less memory. It basically doesn’t need to read every feature into memory as the FeatureMerger does. Without further blocking (group-based) transformers in the workspace, the data can start to be written while joins are still going on!


Test 2
Test 2 is the same simple recombine as test 1, but using a SQL Server database rather than a CSV file. As a proper database, the FeatureReader and SQLExecutor are now valid options.

Time

Max Memory

Joiner

21′ 40″

214,160kb

FeatureMerger

13′ 30″

881,160kb

FeatureReader

1hr 14′ 21″

45,892kb

SQLExecutor

1 hr 55′ 56″

45,860kb

This time the FeatureMerger wins out in terms of speed. I suspect that’s because the Joiner, FeatureReader, and SQLExecutor are having to hit the database with every read, rather than just once for the FeatureMerger. So they are low on memory use, but slow because of waiting for multiple database responses. To test that theory there is a setting in the Joiner to read the entire table contents into a cache at the start of the process.

When that is set the result comes back as follows:

Time

Max Memory

Joiner

11′ 09″

872,764kb

FeatureMerger

13′ 30″

881,160kb

So there too the Joiner is the better option. Notice that the memory usage has shot up to FeatureMerger levels – because the Joiner is now also reading all data at the start – but that it still runs faster.


Test 3
In this test the spatial features will be renumbered. Originally they were 1-1,229,200 but now I’ll group them into 100 batches where each batch’s features are numbered 1-12,292.

It’s a bit contrived, but simply means each supplier will need to be used multiple times. You could – for example – think of it as an address being matched to a zip/postal code (where many addresses have the same code).

No pre-fetch was set for the Joiner; just a cache size of 10,000.

The results were as follows:

Time

Max Memory

Joiner

8′ 37″

80,776kb

FeatureMerger

11′ 32″

878,680kb

The main reason for the performace benefit of the Joiner is that it doesn’t need to read every single record. Here only the first 12,292 records could possibly be a match: the Joiner can read individual records but the FeatureMerger has to read all of the table regardless of the fact that 90% of it would never be used.

But without a pre-fetch, wouldn’t the Joiner have to keep hitting the database? Well, no. What happened is the Joiner filled a cache with up to 10,000 records by itself as it retrieved data. So when the same record needed to be matched a second time, FME already had possession of it. The log proves that 80% of the time the Joiner didn’t need to go to the database for the required info:

@Relate: Database query statistics for table `Joiner:dbo.ADDRESSINFO’: 1229200 queries made of which 0 were sequential duplicates and 990099 hit the record cache of 10000 records (80% overall cache hit)

So, even though the FeatureMerger already had all the features, the Joiner managed to keep up by caching re-usable records.

Conclusions
The use cases are fairly clear now:

  • If you’re going to be joining the entire attribute table, and there aren’t enough records to exhaust system memory, then it doesn’t matter too much which transformer you use. But if you do use the Joiner, and it’s on a spatial database, be sure to set a large cache pre-fetch.
  • If you’re going to be joining the entire attribute table, and there are too many records to comfortably read into memory, then the Joiner is the better option. It might start out a little slower, but it will get there without causing a ton of disk thrashing or using up too many system resources.
  • If you’re going to be joining only part of the attribute table, then the Joiner is again the better option. It doesn’t need to read the entire table when only part of the data is required.

What might not be clear is how to tell when memory might become an issue. Well, the space used by the data table would be a big clue, especially when combined with the size of the spatial data files. If they are getting on to 4GB then I would think it’s pushing the limit for a single read and that the Joiner should be preferred.

Also, sometimes the use of a particular transformer is dictated by it being a special case.

Special Cases
Yes, there are times when you would absolutely need to use one particular transformer over the others.

Spatial Relationship
If you want to merge features on the basis of a spatial relationship, then only the FeatureReader (or SpatialFilter) will do the job. The others work on ID only.


Unused References
When you want to find which supplier features were/weren’t used in the join process, then the FeatureMerger is the transformer to use. The others will tell you which requestors found a match, but not which suppliers did.


Transformation Results
It’s a bit hard to describe, but the FeatureMerger is of great use when you want to transform some data, but merge the results back onto the original geometry.

For example, here a user wishes to check for self-intersections, but not actually break the data at those points. The trick is to do the self-intersection, but then merge the resulting _segments attribute back onto the original geometry with a FeatureMerger.


Multiple Keys
When there are multiple ID matches that need to be made (i.e. there are multiple requester fields that must match multiple supplier fields) then the Joiner transformer is the one to use. The FeatureMerger only works on a single ID field (unless you can concatenate multiple fields). The FeatureReader would work (using a WHERE clause) but wouldn’t really be worth doing unless there was a spatial join aspect as well.


Multiple Tables
Another scenario involves multiple tables; basically a requestor and supplier but where the join is made through a lookup table between the two.

For example, I have a dataset of water pipes and want to identify who manufactured the pipe. Each pipe has a MaterialID field to identify the material type in a PipeMaterials table. In turn the PipeMaterials table has a ManufacturerID field to identify the manufacturer in a MaterialsManufacturer table.

To get the name of the pipe manufacturer would need two Joiner transformers at least, or readers to read three full tables for the FeatureMerger.

But the SQLExecutor transformer is ideal here because the join can be built into a single query, for example:

select * from MaterialManufacturer where MateralManufacturer.ManufacturerID = (select ManufacturerID from PipeMaterials where PipeMaterials.MaterialID = @Value(PipeMatID))


Relationship Integrity
When the integrity of the relationship needs to be tested, then the Joiner is again the transformer to use. It will let you test whether the requester/supplier match is 1:1, 1:M, or various combinations – and throw an error if this relationship is breached. The FeatureMerger has fewer options and will not error if a problem exists.

Caution!
The Joiner will suffer badly if data is not indexed on the search key. For example, one test query took 30 minutes without an index, versus 2 minutes when the key field was indexed. You’ll notice in the log file when this happens because the time listed for “CPU” will be way less than the total time.

The FeatureMerger doesn’t care about an index, because it has to read everything. So it might appear to be faster, but if so the Joiner/database probably isn’t tuned correctly.

Also, the FeatureReader isn’t as bad as it appears here. The FeatureReader is to the Joiner as the SpatialFilter is to the FeatureMerger, so when data needs to be joined by a spatial relationship then that transformer should be the one to use, and there are an absolute mountain of use cases that benefit from it.

Similarly, the SQLExecutor has many other uses besides joining data, that the Joiner and FeatureMerger just couldn’t achieve.

I hope this is a useful guide to the use of these two transformers.


Webinar: 7 Spatial Database Tips for PostGIS, Oracle, and SQL Server
August 26 - Become a database dynamo with these productivity-boosting tips