I need to create a module to allow for entering one or more quality attributes of delivered products. This is for packaged software and so I need to be able to configure this module to work for a variety of customers. Here are the general requirements:
Samples of a delivered product are taken to be analyzed for various qualities. Sample evaluations are generally more than one value and are generally numerical (count of some defect, percentage of sample weight that exhibits some attribute, etc.). Although on occasion alpha values would be handy, but usually these can be converted to some numerical representation of the value. For example, overall color (light, medium, dark can be 1, 2, 3).
The quality attributes that are tracked will be different for the type of item received. Also, as mentioned above, this is packaged software. I have to handle this in both a licensed product situation in which the database is a single customer and in a SaaS solution (with less than 100 customers in the same database).
This is a classic Entity Attribute Value pair scenario. I know that Tom despises that pattern. I do as well. I believe the following accurately summarizes his recommendation for this:
1. first and foremost, design the tables exactly as needed. For example, if I need a quality attribute called moisture_pct then that is what I should call the column.
2. if not choosing the first recommendation then have a wide table of generic columns, with correctly defined data types, such as quality_num1 NUMBER, quality_num2 NUMBER, quality_char1 VARCHAR2(200), etc.
Is this still the recommended guidance or have newer patterns emerged or new database features available that I should consider using?
Best is subjective.
So instead of asking "which is best", ask "what are the trade-offs for each solution?"
Here's my take on the trade-offs for these, plus some other options:
1. A specific column for every attribute.Upsides: Defining the correct data types and constraints gives great quality data.
You can see all the attributes by querying the data dictionary, this makes the schema easy to understand.
If you need reports aggregating values (e.g. what's the average moisture_pct?) this is easy to do with SQL.
Downsides: you may get requests for new columns faster than your development team can add them. If values come from automated imports, new values could appear on-the-fly, leading to lost data. This is likely to cause customer dissatisfaction; worst case they move to a different product.
Can lead to a large number of nullable columns that are only relevant to a few customers. With many customers that each want a few unique attributes, you can hit the 1,000 column limit for a table in Oracle Database.
2. Use generic "data type-specific" columns (number_1, number_2, date_1, date_2, etc.)Upsides: You can support a large number of custom attributes with little or no development.
Downsides: Some customers may quickly hit the limit for the number of attributes for a specific data type. Leading to more development effort.
It's hard to impossible to understand the schema just by querying the data dictionary. You need a table mapping attribute names to columns.
The meaning of a value in "NUMBER_1" changes depending on the customer. This makes debugging significantly harder.
3. "Full" EAVUpsides: You can support an unlimited number of custom attributes with little or no development.
Downsides: The queries to reconstruct a row are complex.
It's tough to implement constraints, data quality is likely to be poor.
Analytic queries may be practically impossible to write and run very slowly (with limited tuning options).
Easily abused to support other aspects of the application
4. "Partial" EAVHave a CUSTOM_ATTRIBUTES table:
create table custom_attributes (
product_id integer,
attribute_name varchar2(100),
attribute_value varchar2(100)
);
Optionally you could make custom_date_attributes, custom_number_attributes, etc.
Upsides: You can support an unlimited number of custom attributes with little or no development.
You can use FKs to ensure the attributes only apply to the correct product/sample.
Custom attributes are limited to the product domain, limiting the chance this will be repurposed for other features (unlike full EAV).
Downsides: The queries to reconstruct a row are complex.
Reporting queries are hard to write and tune
5. Store optional attributes as JSONUpsides: You can support an unlimited number of custom attributes with little or no development.
JSON is a popular protocol with lots of support from tools.
Downsides: You can store new attributes that the application is unaware of, meaning you can't manage these in the application.
What's stored in each document can change row-to-row, making it hard to understand the schema. The JSON Data Guide can help with this and the previous issue by exposing the attributes as virtual columns
Can be abused to store "anything" in the database.
You could make similar arguments for XML instead of JSON.
I'm sure there are more things you could list in the pros and cons for each option. The point is you make this list, then assess which option has the most upside for the least downside for you. For example, most of the downsides for the "full relational" tables relate to speed of delivery. If new attributes pop up daily, this is a problem. But if something new comes along only once a year, this is practically a non-issue.