Skip to Main Content
  • Questions
  • Best approach to handle customer-specific data attributes in a data model for packaged software

Breadcrumb

Question and Answer

Chris Saxon

Thanks for the question, Rodney.

Asked: April 19, 2020 - 1:13 pm UTC

Last updated: April 27, 2020 - 4:12 pm UTC

Version: Oracle 19c

Viewed 1000+ times

You Asked

I need to create a module to allow for entering one or more quality attributes of delivered products. This is for packaged software and so I need to be able to configure this module to work for a variety of customers. Here are the general requirements:

Samples of a delivered product are taken to be analyzed for various qualities. Sample evaluations are generally more than one value and are generally numerical (count of some defect, percentage of sample weight that exhibits some attribute, etc.). Although on occasion alpha values would be handy, but usually these can be converted to some numerical representation of the value. For example, overall color (light, medium, dark can be 1, 2, 3).

The quality attributes that are tracked will be different for the type of item received. Also, as mentioned above, this is packaged software. I have to handle this in both a licensed product situation in which the database is a single customer and in a SaaS solution (with less than 100 customers in the same database).

This is a classic Entity Attribute Value pair scenario. I know that Tom despises that pattern. I do as well. I believe the following accurately summarizes his recommendation for this:

1. first and foremost, design the tables exactly as needed. For example, if I need a quality attribute called moisture_pct then that is what I should call the column.

2. if not choosing the first recommendation then have a wide table of generic columns, with correctly defined data types, such as quality_num1 NUMBER, quality_num2 NUMBER, quality_char1 VARCHAR2(200), etc.

Is this still the recommended guidance or have newer patterns emerged or new database features available that I should consider using?

and Chris said...

Best is subjective.

So instead of asking "which is best", ask "what are the trade-offs for each solution?"

Here's my take on the trade-offs for these, plus some other options:

1. A specific column for every attribute.

Upsides: Defining the correct data types and constraints gives great quality data.
You can see all the attributes by querying the data dictionary, this makes the schema easy to understand.
If you need reports aggregating values (e.g. what's the average moisture_pct?) this is easy to do with SQL.

Downsides: you may get requests for new columns faster than your development team can add them. If values come from automated imports, new values could appear on-the-fly, leading to lost data. This is likely to cause customer dissatisfaction; worst case they move to a different product.
Can lead to a large number of nullable columns that are only relevant to a few customers. With many customers that each want a few unique attributes, you can hit the 1,000 column limit for a table in Oracle Database.

2. Use generic "data type-specific" columns (number_1, number_2, date_1, date_2, etc.)

Upsides: You can support a large number of custom attributes with little or no development.

Downsides: Some customers may quickly hit the limit for the number of attributes for a specific data type. Leading to more development effort.
It's hard to impossible to understand the schema just by querying the data dictionary. You need a table mapping attribute names to columns.
The meaning of a value in "NUMBER_1" changes depending on the customer. This makes debugging significantly harder.

3. "Full" EAV

Upsides: You can support an unlimited number of custom attributes with little or no development.

Downsides: The queries to reconstruct a row are complex.
It's tough to implement constraints, data quality is likely to be poor.
Analytic queries may be practically impossible to write and run very slowly (with limited tuning options).
Easily abused to support other aspects of the application

4. "Partial" EAV

Have a CUSTOM_ATTRIBUTES table:

create table custom_attributes (
  product_id      integer,
  attribute_name  varchar2(100),
  attribute_value varchar2(100)
);


Optionally you could make custom_date_attributes, custom_number_attributes, etc.

Upsides: You can support an unlimited number of custom attributes with little or no development.
You can use FKs to ensure the attributes only apply to the correct product/sample.
Custom attributes are limited to the product domain, limiting the chance this will be repurposed for other features (unlike full EAV).

Downsides: The queries to reconstruct a row are complex.
Reporting queries are hard to write and tune

5. Store optional attributes as JSON

Upsides: You can support an unlimited number of custom attributes with little or no development.
JSON is a popular protocol with lots of support from tools.

Downsides: You can store new attributes that the application is unaware of, meaning you can't manage these in the application.
What's stored in each document can change row-to-row, making it hard to understand the schema. The JSON Data Guide can help with this and the previous issue by exposing the attributes as virtual columns
Can be abused to store "anything" in the database.

You could make similar arguments for XML instead of JSON.

I'm sure there are more things you could list in the pros and cons for each option. The point is you make this list, then assess which option has the most upside for the least downside for you. For example, most of the downsides for the "full relational" tables relate to speed of delivery. If new attributes pop up daily, this is a problem. But if something new comes along only once a year, this is practically a non-issue.

Rating

  (3 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Nice general answer! Suggested add...

Stew Ashton, April 23, 2020 - 8:23 am UTC

Good list! Perhaps it would be useful to mention upsides vs. downsides of changing the data. Solution 1. is the easiest for updates. Solution 5. (JSON) requires updating the entire JSON string whenever any attribute changes, with all the related redo / undo downsides.

(The redo / undo downsides should be mitigated in version 20c with the new JSON_TRANSFORM function.)

Great summary

Rodney, April 23, 2020 - 10:55 am UTC

Thanks Chris for the great summary of options. I was unfamiliar with the JSON Data Guide feature and so asking the open ended question accomplished my goal which was to uncover additional options unknown to me. My particular use case is probably better served with either the Partial EAV or the JSON approach. I will likely need to provide the ability to configure validations on the data as well, such as "moisture_pct cannot be greater than 100". With the EAV model I will have that attribute defined in a table somewhere and so that would be the starting point to add some sort of infrastructure to provide configured validation. Since I am not familiar with the JSON Data Guide, can you tell me if that would similarly be easy enough to implement? Presumably that will simply be another attribute of the "column" I am defining in the document. Also, how would you rate their performance in querying when comparing the two approaches? This is obviously the big downside of both approaches. I have experience with the Partial EAV approach and have been able to overcome performance issues with it (enough anyway) using the WITH clause in views.
I guess, based on what you know of my requirements, do you have a strong opinion one way or another in the Partial EAV vs. JSON approach?
Connor McDonald
April 24, 2020 - 2:45 am UTC

(Connor here)

Depends very much on the tooling. I know that statement violates the purists who want data modelling separate from the apps, but physical design is where reality and data collide.

If you have tools/platforms which handle JSON well (Javascript being an obvious example), then you'll probably find most benefit with JSON.

If you have some older tools (think Forms/Reports etc) where JSON is a mystery which means every single query needs to unfold the JSON, then maybe EAV works better.

It is always a cost/benefit decision here.

A reader, April 27, 2020 - 11:37 am UTC

From my experience, go with #2

The way we implemented it

Have a table with number_1, number_2, date_1, date_2, etc.

Then create view on top of that table and give number_1, number_2 menaingful names, e.g.

create view aaa as
select t.number_1 as moisture_pct, t.date_1 as measure_date, ....
from t;

This way you can ensure that numbers are stored in number columns, dates in dates and you can create indexes on individual columns.

It's not that flexible as "generic" attributes, but it has much much more data integrity (compared to #4) and you can tune the performance (which, again, won't be the case for option #4).


Chris Saxon
April 27, 2020 - 4:12 pm UTC

The issue with this is different customers may store different attributes in each column!

If you've got the development cycles to define a view, you may as well create columns with the correct names in the first place.

While it's tougher, there are tuning options for option #4 - materialized views being one of the best candidates.

More to Explore

Design

New to good database design? Check out Chris Saxon's full fundamentals class.