A Novel Approach to Entity-Resolution using Serverless Technology

Estimated reading time: 14min

The "Transitive Hop" problem

As the engineering team at Regis24 , a German consumer credit bureau, we faced a growing technical problem - how to handle hundreds of millions of data sets about tens of millions of people and connect the data in a way that is scalable, fast to search and also cost-efficient.

Our typical use case is as follows: we may have a number of data sets that show John Smith lived in Hamburg, and a number of data sets that show he moved to Munich, and then lastly data sets that show he moved to Berlin. This is a fairly trivial example, but you can imagine that in theory an individual could have moved from A to B to C… to Z and beyond.

This data connection is referred to as “entity resolution”, or “record linkage” and has utility in financial fraud detection, anti-money laundering, security, advertising technology, real-time product recommendation and many other fields. Of course, the data being linked does not always have to refer to people - it could be products, proteins, car parts etc.

The functional requirement at Regis24 was that one can search with John’s Hamburg information, and retrieve the Berlin result, as quickly as the Hamburg result. Or as a more extreme example, search with the data of A, and find the Z result as fast as the B.

The problem here is two-fold. First, each “node” (i.e. John in Hamburg) can have any amount of similar, non-identical data. Searching all of this data just to get the result “Hamburg” takes up too much time. Second, to reach the result “Berlin”, one would first have to go through the result “Munich”, meaning that the more “nodes” that we “hop”, the longer our searches take to complete. We call this the “transitive hop” problem.

Forced into development

A credit bureau’s job is to sell data, not build database technologies, so naturally we tried many existing solutions on the market to try to address the above requirements. First, we tried Apache Spark’s MapReduce and Connected Components - an approach where all the entities get recalculated on a regular basis. However, being a resource-hungry batch process, it takes a lot of hardware and is by definition permanently outdated.

Next, we started to explore the world of graph databases. Like many who want to dip their toes into graph databases, we started with Neo4j. This was incredibly slow to search our data and would have been prohibitively expensive. We also tried AWS’s graph database, Neptune, but this delivered similarly bad performance. 

Serverless enters, stage right

After realising that we were unlikely to find any existing technology to satisfy our requirements, we reluctantly concluded we would have to go back to the drawing board and try to build something ourselves. But how could we build something and succeed where everyone else so far had failed?

The answer came thanks to the emergence of serverless technology. We realised that it might be possible to develop a novel entity-resolution database technology, starting from scratch, by taking advantage of the possibilities opened up by serverless technology.

Three years ago (approx 2018) we started researching and developing TiloDB, and about a year ago we began to put TiloDB into production. Today, TiloDB is the core data infrastructure for Regis24 connecting 180 Million data sets about 45 Million persons (entities).

(Fun fact: TiloDB* is named after Tilo, who is the three-year-old son of Stefan, the lead engineer behind the technology. Tilo is a pretty cool kid.).

TiloDB's advantages

Taking advantage of existing serverless technologies, mainly Lambda, S3, DynamoDB and Kinesis, allowed us to focus on the really challenging parts of entity resolution. Additionally, we implemented the solution using Golang which helped us to realise the parts where parallelisation was required.

Arguably, TiloDB’s most impressive metric is its search response time, which is a near constant ~150ms no matter how many “nodes” have to be traversed. We seem to have indeed solved the “transitive hop” problem.

Addition of data into an entity takes <500ms, and immediately after data is added or removed a specific event (e.g. entity created, entities merged) is triggered, which may be used for alerting or other actions.

Thanks to running on AWS serverless technology, TiloDB is also very cheap to run. We can predict the price per insert or search, and thanks to AWS scale, the more TiloDB is used, the cheaper it gets to run. Compare this to server-based technology, where you can throw more servers at a problem to try and solve it, but you pay for that with increased server costs. TiloDB scales up and down, thus keeping costs to a minimum.


Interactive Demo

Click through the walkthrough to see how to use this interactive demo of TiloDB, then you can play with TiloDB yourself. You will be able to see other people submitting data to TiloDB in real-time. 

Tutorial skip tutorial
  1. 1. Submit random data.
  2. 2. Connect with more data.
  3. 3. Search for an entity.
  4. 4. Enable all demo features.

Releasing TiloDB as Open Source Software

We are currently in the process of spinning the TiloDB technology out into a new company to focus on its development.

The first thing we want to do is release TiloDB as open source software (OSS). We believe that there are lots of use cases for TiloDB that we can't possibly imagine and we would love to build a community around the tech that is as passionate about it as we are.

If you would like to be notifed when we release TiloDB as OSS, please leave your email address below. We will only contact you about TiloDB's OSS release.


How TiloDB works

The API is provided using GraphQL via AWS API Gateway and Lambda. Adding data into the database works by calling the submit mutation, which in a very simplified way looks like:

{
  submit(
    new: {
      firstName: "John"
      surName: "Smith"
      address: {
        street: "Augustinerstr."
        houseNumber: "1"
        city: "München"
      }
    }
    old: {
      firstName: "John"
      surName: "Smith"
      address: {
        street: "Jungfernstieg"
        houseNumber: "7"
        city: "Hamburg"
      }
    }
  )
}

The example represents the submission of the current address for John as well as one of his previous addresses. A submission might only include the current known address and include no previous addresses. TiloDB makes the connections with existing data in the database using the pre-configured rules, regardless of how much data is submitted.

The invoked GraphQL Lambda then takes this data and puts it into a Kinesis stream, using it as a data buffer for unexpected high loads, but most of all to ensure that the data will eventually be added to the database. This is important not only when working with unavailable services, but also when it comes to concurrent writes to the same entity.

The next Lambda function, the assembly, is triggered as soon as an item is put into the Kinesis stream. This is where the data gets added into the database, and it can roughly be separated into the following steps:

  • matching
  • deduplication
  • updating
  • indexing

Indexing

An entity resolution technology must compare different data sets with other potential matches in order to identify the right candidates that belong to one particular entity. Additionally, comparison of different data sets is also relevant when performing the search as well as when identifying possible duplicates. The goal is always to have as few comparisons between data sets as possible. Ideally, each resulting data “block” contains only the correct results. However, for fuzzy matching, e.g. when using the Levenshtein distance, this is not possible.

Consider the following two rules that could be used for matching:

  • R1: matching if: first name, surname and city are lowercase equal, OR
  • R2: matching if: city and street are lowercase equal and the name parts have the same metaphone value and a maximum Levenshtein distance of 1

    Creating an index key for R1 is simple. For the old and the new (current) data of the previous submission, the resulting index keys can be calculated as:

  • R1:john:smith:münchen
  • R1:john:smith:hamburg

Resulting in any data having the same index key to be matching without any further comparison.

For R2 the resulting index key cannot contain any hint for the Levenshtein distance as that is based on having another data set with which to compare. However, if we ignore the Levenshtein distance for a moment, then the resulting index keys would be:

  • R2:münchen:augustinerstr.:JN:SM0
  • R2:hamburg:jungfernstieg:JN:SM0

Where JN is the metaphone value for John and SM0 for Smith. Everything with the same index key would result in being a potential match.

The indexes are then stored in an AWS DynamoDB table, together with some additional data.

Assuming for the new data an ID aaa and for the old data an ID bbb was generated - the actual implementation uses random UUIDs - then the index table would contain the following values:

index_key data
R1:john:smith:münchen {"aaa"}
R1:john:smith:hamburg {"bbb"}
R2:münchen:augustinerstr.:JN:SM0 {"aaa:john:smith"}
R2:hamburg:jungfernstieg:JN:SM0 {"bbb:john:smith"}

The data column is a string set that contains at least the ID of the corresponding data set(s). For R2 it also contains the first and surname, which will be used when doing the final comparison. This has the advantage of not needing to load the actual data from somewhere else.

Furthermore, this table structure has the advantage that data can be added very easily. This is valid in case of a complete new index key as well as for adding data to an already existing index key. Both cases can be handled with the same DynamoDB update query:

  • Key: {"index_key": {S: "R1:john:smith:münchen"}}
  • UpdateExpression: "ADD data :data"
  • ExpressionAttributes: {":data": {SS: ["aaa"]}}

If the item already exists, then aaa will be added to the string set, otherwise a new index entry will be created under the provided key. However, to understand the other steps, it is important to first understand how indexing works.


Matching

Assuming the two previous data sets have already been indexed and now the following data set ccc needs to be added:

{
  submit(
    new: {
      firstName: "John"
      surName: "Smith"
      address: {
        street: "Hofgraben"
        houseNumber: "3a"
        city: "München"
      }
    }
  )
}

When searching for possible matches, the index keys according to R1 and R2 would be:

  • R1:john:smith:münchen
  • R2:münchen:hofgraben:JN:SM0

Searching with this data in the index table, results in exactly one entry: {"aaa"} with currently only one data set aaa. Because the used rule R1 does not require any further checks, the matching is done here already. In that case the new data set needs to be added to the existing entity.

For another data set ddd

{
  submit(
    new: {
      firstName: "Johnn"
      surName: "Smith"
      address: {
        street: "Augustinerstr."
        houseNumber: "11"
        city: "München"
      }
    }
  )
}

the resulting index keys would be

  • R1:johnn:smith:münchen
  • R2:münchen:augustinerstr.:JN:SM0

resulting in a possible match for R2 on {"bbb:john:smith"}. Since everything required for the additional Levenshtein check is now present, no further data needs to be loaded. The Levenshtein distance between John and Johnn is 1 and 0 for Smith and Smith, therefore the data set should also be added to the existing entity.

An entity can be represented as a graph comprising nodes (data sets) and edges (matches). There are two types of edges so far: rule-based and fact-based. Every edge found during the matching is rule-based. Everything that has been provided from the client side, is more or less a fact-based edge. E.g. the client may have provided the information that aaa and bbb represent a relocation.

This results in the following edges so far:

  • aaa:bbb:RELOCATION
  • aaa:ccc:R1
  • bbb:ddd:R2


Deduplication

Let's look at data set eee

{
  submit(
    new: {
      firstName: "John"
      surName: "Smith"
      address: {
        street: "Hofgraben"
        houseNumber: "3"
        city: "München"
      }
    }
  )
}

which is similar, but not identical (see house number), to ccc - a non-identical duplicate. This is quite typical, e.g. a different order transaction, a different date or anything else, that prevents one from simply ignoring the data set.

When trying to match the data set in the same way, it would result in these new edges:

  • aaa:eee:R1
  • ccc:eee:R1
  • ccc:eee:R2

This may look fine at first, but what if there would be only four non-identical duplicates?

These are already twelve edges, or n*(n-1)/2 for each rule. Consider a frequent online shopper, with only 100 orders, this would be 9.900 edges!

To prevent the exponential growth of the edges, a new rule can be added:

D1: matching if: first name, surname, city and street are lowercase equal

The deduplication rule must cover all the matching relevant fields from R1 and R2. The assumption here is, that all these fields do not change as frequently as e.g. an order number among different data sets.

Finding duplicates using D1 works in the same way as finding matches: generate the index key, search with index key and if needed filter the candidates when fuzzy matching is required.

Data set eee would then be recognized as a duplicate of ccc. It might simply be stored as a single edge in the form ccc:eee:D1 or slightly optimized as in:

{
  "ccc": ["eee"]
}

The important part is, that duplicates do not create edges and will not be indexed, because they are not providing new relevant data for matching or searching.

When removing the original of a duplicate, then any of its duplicates become the new original and needs to replace all indexes and edges with the new id.

For the frequent shopper example provided above, this would reduce the number of edges from 9.900 to 99, resulting in linear growth rather than exponential growth.


Updating

When adding a new data set, there are three potential scenarios, that can occur:

  • no matching data sets were found -> a new entity must be created
  • one or more matching data sets for the same entity were found -> the entity needs to be updated
  • matching data sets from multiple entities were found -> the entities must be merged into one entity

For this use case, each entity is stored in a single AWS S3 object. Each object uses JSON lines (jsonl). The first line contains a header and all other lines contain one data set each.

The content for the example entity looks something like:

{"entityID":"157009fd-65f3-4ec6-ac29-7f9f81cafa72","edges":["aaa:bbb:RELOCATION","aaa:ccc:R1","bbb:ddd:R2"],"duplicates":{"ccc":["eee"]}}
{"id":"aaa","firstName":"John","surName":"Smith","address":{"street":"Augustinerstr.","houseNumber":"1","city":"München"}}
{"id":"bbb","firstName":"John","surName":"Smith","address":{"street":"Jungfernstieg","houseNumber":"7","city":"Hamburg"}}
{"id":"ccc","firstName":"John","surName":"Smith","address":{"street":"Hofgraben","houseNumber":"3a","city":"München"}}
{"id":"ddd","firstName":"Johnn","surName":"Smith","address":{"street":"Augustinerstr.","houseNumber":"11","city":"München"}}
{"id":"eee","firstName":"John","surName":"Smith","address":{"street":"Hofgraben","houseNumber":"3","city":"München"}}

In addition to to this file, two AWS DynamoDB tables are populated.

entity_id s3_location ...
157009fd-65f3-4ec6-ac29-7f9f81cafa72 1570/157009fd-65f3-4ec6-ac29-7f9f81cafa72_10ff295f-8f4a-41d3-b35f-2d556b7cb2d1 ...

This table holds the path to the S3 location and other meta data.

The data set table:

data_set_id entity_id ...
aaa 157009fd-65f3-4ec6-ac29-7f9f81cafa72 ...
bbb 157009fd-65f3-4ec6-ac29-7f9f81cafa72 ...
ccc 157009fd-65f3-4ec6-ac29-7f9f81cafa72 ...
ddd 157009fd-65f3-4ec6-ac29-7f9f81cafa72 ...
eee 157009fd-65f3-4ec6-ac29-7f9f81cafa72 ...

When adding a new data set for a new entity, then the S3 object will be created and the entity as well as the data set table populated.

When adding a new data set for an existing entity, then a new S3 object will be created (adding header, copying existing data sets and appending the new data set), the new entry in the data set table will be created, the location in the entity table will be updated and the old S3 object will be removed.

When adding a new data set that merges two entities, then the surviving entity wil be selected, a new S3 object will be created (adding header, copying existing data sets from both old objects and appending the new data), the new entry in the data set table will be created (pointing to the survivor), the location for the surviving entity will be updated and both old S3 objects will be removed.


Searching

Searching works similar to finding matches or duplicates:

  • create index keys for the provided search
  • find possible candidates and filter them
  • find the entries in the data set table for these candidates
  • find the S3 location(s) from the entity table
  • load the S3 object(s)
  • perform any kind of data selection or aggregation on the entity data sets

It’s worth mentioning the advantage of this approach: the linear time complexity, which is already better than the exponential complexity used by standard blocking techniques, is, in reality, very close to a constant response time no matter how complex the entities are. I.e. using our example from the beginning, searching with A brings the result Z as fast as it would return the result B.

Since the recognized duplicates have not been indexed, they don't show up as possible candidates. The database lookups and S3 get object calls for the few remaining candidates can easily be parallelized - something that can easily be done because it is implemented in Golang. Also processing, e.g. aggregating certain fields from the entities data sets, can already be started as soon as the second line from the S3 object was downloaded. However, this, for obvious reasons, cannot be less than O(n).


Further Notes

Some parts were not yet covered, but they are important for the overall process.

Parallelization and Locking

Data may be added in parallel. Each entity must therefore be locked to prevent conflicting writes on the same entity (entity table or S3 object). This is done using an UpdateItem request instead of an GetItem request when loading the data from the entity table:

  • ConditionExpression: "attribute_not_exists(#lockConstraint) OR #lockConstraint=:lockConstraint"
  • UpdateExpression: "SET #lockConstraint=:lockConstraint"
  • ReturnValues: "ALL_NEW"

However, if multiple entities are involved in an update process, it must be ensured, that all locks have been sucessfully acquired. If that is not the case, then all locks must be released again to prevent dead-locks.

Failure Handling and Execution Plan

Any kind of error, including failing to acquire locks, should simply result in letting the lambda fail (after a reasonable amount/time of retries). Once the Lambda invocation has failed, the Lambda service will, due to using Kinesis, automatically retry to invoke it with the same data.

In order to prevent inconsistent entity states, before any data modification takes place, a full execution plan must be created. Each step of that plan represents a small idempotent action, e.g. an update in the DynamoDB. That plan is also stored in an S3 bucket before executing it. If the process fails, then on the next Lambda invocation the plan will be loaded and replayed right from the beginning. Because each step is idempotent and the involved entities are locked, it is safe to do so.

To ensure a consistent state during execution of the plan, the order of the steps is important. E.g. updating the S3 location must happen after the S3 object was written and adding an index should be one of the very last steps.

Matching and Search Rules

To keep things (relatively) simple in this description only two rules were used for both matching and searching. However, things start to get interesting when the rules for matching and the rules for searching are different.

Take your time and think about the consequences, when for example defining a rule S1 which is similar to R1 but without including the first name. And then use S1 exclusively in the search.


Alternatives

As a credit bureau, our job was to sell data, not build new types of databases. Nevertheless, we were forced into creating TiloDB because the alternatives for entity resolution provided entirely unsatisfactory performance. (N.B. for an interesting review of the state of entity resolution technology, see this blog post by Adrian Colyer, Venture Partner at Accel.)

Offline Entity Matching

The traditional approach to entity matching would be to precalculate all matches up-front. For small amounts of data sets, this works quite well. Larger amounts of data sets require some kind of blocking approach similar to the used index key. However, even the slightest change would require a complete recalculation of all matches.

Relational Database

Using a recursive search in a relational database would also be possible. But there is no optimization for non-identical duplicates and jumping from one data set to the next to the next to the next ... takes a very long time. Nothing real-time about that!

Graph Database

Graph databases are great - but not for this use case. They are great at answering questions such as "who are the friends of my friends" or "how is A connected with B". But they fail when asking the question "what are all nodes and edges that belong to a certain sub-graph". The reason is the same as with relational databases: they need to jump from node to node. Beside, the handling for the non-identitcal duplicates must happen outside of a graph database.


The original Tilo

If you have made it this far, then congratulations! Your reward is a photo of the eponymous Tilo, who is currently aged 3 and lives in Brandenburg, Germany. 

Tilo, aged 3

Again, if you want to be notified when we release TiloDB as OSS, please enter your email address below. 


*less fun fact: we just discovered that there is in fact already a database company (albeit a completely different type of technology) with a name that differs from TiloDB by only one letter(!), so I guess we will be changing our name soon.

Copyright 2021, Tilo