Data Mining

At M2009 (SAS's data mining conference), I was approached with the question of scoring association rules for customers. This is not a topic I have thought about very much. More typically, association rules are used qualitatively or to understand products. I hadn't thought about assigning the "best" rule (or rules) back to customers.

As a reminder, association rules provide information about items that are purchased at the same time. For example, we might find that marshmallows and chocolate bars imply graham crackers. The "marshmallows" and "chocolate bars" are the left hand side of the rule (LHS) and the graham crackers is the right hand side (RHS). The presumption is that when graham crackers are missing from a shopper's basket, then they should be there.

Most data mining software, such as SAS Enterprise Miner, SQL Server Data Mining, and SPSS Clementine, can be used to generate association rules. I prefer to calculate the rules myself using database technology, using code similar to that in Data Analysis Using SQL and Excel.

However, data mining tools do not provide the ability to score association rules for individual customers. Neither is this is a topic that I discuss in my book. My goal here is to discuss scoring rules in databases. This is because scoring is computationally expensive. Because databases can take advantage of indexing and parallel processing, they offer scope to make the score more efficient.

Hmmm, what does scoring association rules even mean? Scoring is the process of finding the best rule that a customer matches, either for a single RHS or for all possible RHSs. In the former case, the result is one rule. In the latter, it is an array of rules, for each possible RHS.

An association rule is traditionally defined by three metrics: support, confidence, and lift (as well as a fourth, the chi-square metric, which I prefer). For the purposes of this discussion, the best rule is the one with the highest confidence.

The simplistic way of doing such scoring is by considering each rule for each customer, to determine which rules apply to each customer. From the set that do apply, do some work to find the best one.

Imagine that we have a table, rules, with the following columns:

The number of LHS items (we assume there is 1 RHS item);
The RHS item.
The LHS items, as a string: "item1;item2;..."

There is another table, custitem, containing each customer and each item as a separate row.

The following query find all matching rules for each customer in the innermost subquery, by counting the number of items matched on the left hand side. The outer query then finds the rule (for each RHS) that has the maximum confidence, using SQL window functions.

SELECT cr.*
FROM (SELECT customerid, r.rhs, r.ruleid,
. ...........(MAX(r.confidence) OVER (PARTITION BY customerid, rhs)
.............) as maxconfidence
......FROM (SELECT ci.customerid, r.rhs, r.ruleid,
...................COUNT(*) as nummatches,
............FROM custitem ci CROSS JOIN
.................rules r
............WHERE CHARINDEX(ci.item||';', r.lhs) > 0
............GROUP BY ci.customerid, r.rhs, r.ruleid
............HAVING COUNT(*) = MAX(r.numlhs)
...........) matchrules JOIN
...........rules r
...........ON matchrules.ruleid = rules.ruleid
......) cr
WHERE confidence = maxconfidence

This query is expensive, as you might guess from the use of CROSS JOIN. And, its performance gets longer particularly as the number of rules gets larger (and presumably the number of customers is larger still).

It is possible to make it more efficient, by doing tricks, such as:

If there are a few number of items, then the LHS could be encoded using bits. This eliminates the need for string matching.
The rules can be pruned, so only the rules with the highest confidence are kept.

And, although this cannot be done in SQL, the rules could be ordered by confidence (for each RHS) from highest to lowest. The first match would then stop the search.

An alternative method requires storing the rules in two tables. The first is rules, containing descriptive information about each rule, such as:

ruleid;
rhs; and,
numlhs.

The second is ruleitem, which contains each item in the rules. Incidentally, this is more in keeping with the spirit of normalization in relational databases.

The subquery for the scoring now changes to a join. This is useful, because it means that we can use database mechanisms -- such as indexing and table partitioning -- to speed it up.

SELECT cr.*
FROM (SELECT customerid, r.rhs, r.ruleid,
.............(MAX(r.confidence) OVER (PARTITION BY customerid, rhs)
.............) as maxconfidence
......FROM (SELECT ci.customerid, r.rhs, r.ruleid,
...................COUNT(*) as nummatches, MAX(numlhs) as numlhs
............FROM custitem ci JOIN
.................ruleitems ri
.................ON ci.item = ri.item JOIN
.................rule r
.................ON ri.ruleid = ri.ruleid
............GROUP BY ci.customerid, r.rhs, r.ruleid
............HAVING COUNT(*) = MAX(r.numlhs)
...........) matchrules JOIN
...........rules r
...........ON matchrules.ruleid = rules.ruleid
......) cr
WHERE confidence = maxconfidence

Of course, such an approach makes a difference only when you need to score many customers and you have many rules. This same approach can be used for looking at a single product in the RHS or at several at one time. Of course, this would require summarizing the multiple products at the customer level in order to append the desired information on the customer record.

Data Mining

Saturday, January 8, 2011

Scoring Association Rules - You Must Know