10 Common Mistakes Java Developers Make when Writing SQL

Java developers mix object-oriented thinking with imperative thinking, depending on their levels of:
- Skill (anyone can code imperatively)
- Dogma (some use the “Pattern-Pattern”, i.e. the pattern of applying patterns everywhere and giving them names)
- Mood (true OO is more clumsy to write than imperative code. At first)
But
 when Java developers write SQL, everything changes. SQL is a 
declarative language that has nothing to do with either object-oriented 
or imperative thinking. It is very easy to express a query in SQL. It is
 not so easy to express it optimally or correctly. Not only do 
developers need to re-think their programming paradigm, they also need 
to think in terms of set theory.
Here are common mistakes that a Java developer makes when writing SQL (in no particular order):
1. Forgetting about NULL
Misunderstanding
 NULL is probably the biggest mistake a Java developer can make when 
writing SQL. This is also (but not exclusively) due to the fact that 
NULL is also called UNKNOWN. If it were only called UNKNOWN, it would be
 easier to understand. Another reason is that JDBC maps SQL NULL to Java
 null when fetching data or when binding variables. This may lead to 
thinking that NULL = NULL (SQL) would behave the same way as null == 
null (Java)
One of the crazier examples of misunderstanding NULL is when NULL predicates are used with row value expressions.
Another, subtle problem appears when misunderstanding the meaning of NULL in NOT IN anti-joins.
The Cure:
Train yourself. There’s nothing but explicitly thinking about NULL, every time you write SQL:
Is this predicate correct with respect to NULL?
Does NULL affect the result of this function?
2. Processing data in Java memory
Few
 Java developers know SQL very well. The occasional JOIN, the odd UNION,
 fine. But window functions? Grouping sets? A lot of Java developers 
load SQL data into memory, transform the data into some appropriate 
collection type, execute nasty maths on that collection with verbose 
loop structures (at least, before Java 8′s Collection improvements).
But
 some SQL databases support advanced (and SQL standard!) OLAP features 
that tend to perform a lot better and are much easier to write. A 
(non-standard) example is Oracle’s awesome MODEL clause. Just let the 
database do the processing and fetch only the results into Java memory. 
Because after all some very smart guys have optimised these expensive 
products. So in fact, by moving OLAP to the database, you gain two 
things:
Simplicity. It’s probably easier to write correctly in SQL than in Java
Performance.
 The database will probably be faster than your algorithm. And more 
importantly, you don’t have to transmit millions of records over the 
wire.
The Cure:
Every time you 
implement a data-centric algorithm in Java, ask yourself: Is there a way
 to let the database perform that work for me?
3. Using UNION instead of UNION ALL
It’s
 a shame that UNION ALL needs an extra keyword compared to UNION. It 
would be much better if the SQL standard had been defined to support:
UNION (allowing duplicates)
UNION DISTINCT (removing duplicates)
Not
 only is the removal of duplicates rarely needed (or sometimes even 
wrong), it is also quite slow for large result sets with many columns, 
as the two subselects need to be ordered, and each tuple needs to be 
compared with its subsequent tuple.
Note that 
even if the SQL standard specifies INTERSECT ALL and EXCEPT ALL, hardly 
any database implements these less useful set operations.
The Cure:
Every time you write a UNION, think if you actually wanted to write UNION ALL.
4. Using JDBC Paging to page large results
Most
 databases support some way of paging ordered results through LIMIT .. 
OFFSET, TOP .. START AT, OFFSET .. FETCH clauses. In the absence of 
support for these clauses, there is still the possibility for ROWNUM 
(Oracle) or ROW_NUMBER() OVER() filtering (DB2, SQL Server 2008 and 
less), which is much faster than paging in memory. This is specifically 
true for large offsets!
The Cure:
Just use those clauses, or a tool (such as jOOQ) that can simulate those clauses for you.
5. Joining data in Java memory
From
 early days of SQL, some developers still have an uneasy feeling when 
expressing JOINs in their SQL. There is an inherent fear of JOIN being 
slow. This can be true if a cost-based optimiser chooses to perform a 
nested loop, possibly loading complete tables into database memory, 
before creating a joined table source. But that happens rarely. With 
appropriate predicates, constraints and indexes, MERGE JOIN and HASH 
JOIN operations are extremely fast. It’s all about the correct metadata 
(I cannot cite Tom Kyte often enough for this). Nonetheless, there are 
probably still quite a few Java developers who will load two tables from
 separate queries into maps and join them in Java memory in one way or 
another.
The Cure:
If 
you’re selecting from various tables in various steps, think again to 
see if you cannot express your query in a single statement.
6. Using DISTINCT or UNION to remove duplicates from an accidental cartesian product
With
 heavy joining, one can loose track of all the relations that are 
playing a role in a SQL statement. Specifically, if multi-column foreign
 key relationships are involved, it is possible to forget to add the 
relevant predicates in JOIN .. ON clauses. This might result in 
duplicate records, but maybe only in exceptional cases. Some developers 
may then choose to use DISTINCT to remove those duplicates again. This 
is wrong in three ways:
It (may) solve the symptoms but not the problem. It may as well not solve the symptoms in edge-cases.
It is slow for large result sets with many columns. DISTINCT performs an ORDER BY operation to remove duplicates.
It is slow for large cartesian products, which will still load lots of data into memory
The Cure:
As
 a rule of thumb, when you get unwanted duplicates, always review your 
JOIN predicates. There’s probably a subtle cartesian product in there 
somewhere.
7. Not using the MERGE statement
This
 isn’t really a mistake, but probably some lack of knowledge or some 
fear towards the powerful MERGE statement. Some databases know other 
forms of UPSERT statements, e.g. MySQL’s ON DUPLICATE KEY UPDATE clause.
 But MERGE is really so powerful, most importantly in databases that 
heavily extend the SQL standard, such as SQL Server.
The Cure:
If
 you’re UPSERTING by chaining INSERT and UPDATE or by chaining SELECT ..
 FOR UPDATE and then INSERT or UPDATE, think again. Apart from risking 
race conditions, you might be able to express a simpler MERGE statement.
8. Using aggregate functions instead of window functions
Before
 the introduction of window functions, the only means to aggregate data 
in SQL was by using a GROUP BY clause along with aggregate functions in 
the projection. This works well in many cases, and if aggregation data 
needed to be enriched with regular data, the grouped query can be pushed
 down into a joined subquery.
But SQL:2003 
defined window functions, which are implemented by many popular database
 vendors. Window functions can aggregate data on result sets that are 
not grouped. In fact, each window function supports its own, independent
 PARTITION BY clause, which is an awesome tool for reporting.
Using window functions will:
Lead to more readable SQL (less dedicated GROUP BY clauses in subqueries)
Improve performance, as a RDBMS is likely to optimise window functions more easily
The Cure:
When you write a GROUP BY clause in a subquery, think again if this cannot be done with a window function.
9. Using in-memory sorting for sort indirections
The
 SQL ORDER BY clause supports many types of expressions, including CASE 
statements, which can be very useful for sort indirections. You should 
probably never sort data in Java memory because you think that
SQL sorting is too slow
SQL sorting cannot do it
The Cure:
If
 you sort any SQL data in memory, think again if you cannot push sorting
 into your database. This goes along well with pushing paging into the 
database.
10. Inserting lots of records one by one
JDBC
 knows batching, and you should use it. Do not INSERT thousands of 
records one by one, re-creating a new PreparedStatement every time. If 
all of your records go to the same table, create a batch INSERT 
statement with a single SQL statement and multiple bind value sets. 
Depending on your database and database configuration, you may need to 
commit after a certain amount of inserted records, in order to keep the 
UNDO log slim.
The Cure:
 
No comments:
Post a Comment