Today I attended a class on Building Scalable, High Performance Applications put on by Percona, a bunch of guys who wrote MySQL and started their own consulting firm. There was only two people in the class which was quite surprising as these guys are the best in their space. But here are my notes from the class, for what it is worth. Some of the items are random tidbits that came up. The guy knew PHP so some of the stuff is about PHP.
Response, how long it takes, and throughput, how many users you can serve.
If a feature isn’t core to the user’s experience, it ought to be in another database. Logs, statistics, or other non-customer facing data shouldn’t become a bottleneck to the application’s functions.
It’s good for all the stakeholders to agree on what a reasonable response time is.
Optimizing for throughput can hurt response, optimizing for response can hurt throughput.
Passing non-urgent requests to an asynchronous queue can help throughput and response time. Gearman or ActiveMQ can be used for async tasks that need to be inserted.
Tuning your slowest queries isn’t as helpful as looking at the full stack of what is happening in an app and finding out the bottleneck of regular user requests. Saving on one query could make the overall request take longer because of additional tasks that need to be done. Alternatively, eliminating a bunch of the fast queries might help the application performance, even though they’re fast. Setting your long-query time to 0 for 5 minutes, an hour, whatever you can afford to do, then send the it through mk-query-digest will give you the query that took the longest combined. That will help to find fast queries that take a lot of time because of how often they are called. You can also use tcp dump on port 3306 piped to mk-query-digest as well.
You can add performance data gathering in your live app by grabbing the data randomly (e.g. if (rand(1, 100) == 1) capturePerformance();)
Sphinx is like lucene for PHP.
Looking at average time doesn’t help as much as looking at the 95th percentile, because some pages might be really fast, others might be really slow. (e.g. 95% served within 300ms, 99% within 1200ms)
Make it harder for users to do things that are expensive. Don’t discount the outlier expensive requests, denial of service, users may surprise you. It’s easier to scale task when they are about the same shape.
Cacti and Munin can graph activity over time with plugins for MySQL.
Cross database joins are really no different than cross table joins in a single MySQL instance.
Make your code flexible, put SQL into a library. Don’t hard-code database names, IPs, hosts, reads may need to go against one database, writes against another. Perhaps logs or other functions might hit even another database.
Sharding is for write-heavy applications. A cluster setup is best when grabbing rows by id, but geting a range of data causes a lot of network IO, but they’re good for write-heavy too.
Exact numbers aren’t always needed. You can guestimate or round the number. Grab the number once a day and cache that, use a counter with memcached to batch the updates, etc.
Don’t overengineer, don’t add complexity if you don’t need to. 1 database on 1 machine is simple, master slave less, sharding even less.
First, use caching (memcached) then, if needed use replication. Finally, as a last option, go to sharding.
Grouping writes in a transaction helps a lot. Using a queue to async the writes will help.
InnoDB groups rows by their id in page files. Thus, auto-increment ids are more optimized under innodb than random ids.
Percona’s blog is mysqlperformanceblog.org and looks to have some great articles including EC2 performance and more.
If you’re sharding, there’s several methods. You could shard the same data in different shards that is being accessed for different uses (search db vs data storage db).
Sysbench provides a framework for benchmarking mysql, but replaying the user’s data is the best strategy. Mysqlslap creates random data for testing data load.
When upgrading, upgrade the slaves first, then the master.
In applications, mysql connections should be short lived. You should fetch all the data as early in your request as you can, then close the connection. Stored Procedures can be more performant because they lower the round trips to the database from your application code.
Persistent connections or connection pools can often be bad. JDBC is pretty good, but others (PHP) are not. Creating a connection to MySQL is really cheap.
Dividing reads amongst the slaves in a smart manner allows for DB caching.
Building summary tables for users when they log in to cache what they may be likely to access can increase performance.
Master-master setup (mysql-master-master, MMM) can work well when a database goes down.
When we know MySQL is the issue, use EXPLAIN. Bookmark, read, and use http://dev.mysql.com/doc/refman/5.1/en/using-explain.html.
Varchar is stored as the number of characters + 1, but when aggregating it in memory during a select, MySQL can not use variable width fields, so using VARCHAR(255) for all your fields causes much higher memory usage in queries than might be necessary.
\G at the end of your queries on the command line prints out the rows in blocks and is more readable (if you have one or just a few rows returned).
MySQL is pretty smart about its data distribution. Run ANALYZE TABLE table_name; to reset the usage statistics.
Indexes are stored in a balanced tree.
An index on field A, B, and C takes the three values and concatenates them. It will use A first, then B, then C. If any of the WHERE clause is a range query (<, >, BETWEEN, etc) MySQL will not use the fields after that. In other words, if A was only used in range queries, it wouldn’t make any sense to add B and C to the index. Or you might put B and C before A. The first field in your index should be the most limiting in the result set. Knowing your data will help you to create and use the best indexes.
SQL Tuning by Dan Tow is a good book to understand how things work under the hood.
When JOINing tables if you think MySQL isn’t doing it optimally you can use a STRAIGHT_JOIN and it will filter the result set in the order you have the tables, rather than it’s own idea.
Subqueries are horrible on performance with MySQL’s optimizer. Use a join where you can. SELECT field FROM table procedure analyse(); will help to see what your data looks like with min, max, and average sizes.
A benchmark function helps test speeds: SELECT benchmark(1000000, crc32(“test”));
Since MySQL uses a nested loop join, don’t be shy about denormalizing your data. Optimizer decision making is all about tradeoffs.
`mysqladmin extended status -i 2 -r` will show the global status every 2 seconds. -r will do relative.
SHOW SESSION STATUS will give the temp tables create for the session and a lot of other useful data that I don’t understand completely. Setting @@profiles = 1 will start storing profiles for the queries run. Then SHOW PROFILES; will show the profiles stored for each query, but you still have to figure out what to do with it. And that is for the whole system, not thread/connection specific.
Using IN() is faster than ranges (e.g. WHERE id IN(1, 2, 3, 4, 5, 6, 7, 8, 9, 10); is faster than WHERE id BETWEEN 1 AND 10; supposedly.) Because the equalifier is more mature than the range.
Day one advice: keep it simple, normalize with the understanding you may denormalize later as needed. Use unicode where needed, not necessarily everywhere. Always have a primary key, keep it short, try and make it your primary access method. Innodb is safest, use it unless you have a specific reason to use MyISAM.
If you have a column that gets written to (last_login_date on a user table) separate them out to another table. Keep your reads separate from your writes. “Hot column on a wide table”
Domas has a good article on why round-trips take time.
Mark Calahan wrote a nice post about using LIMIT with Mysql. Using LIMIT 100, 10 reads 110 rows, only returns 10. Grabbing the last id from the previous page and doing a WHERE id > last_id LIMIT 0, 10 is faster.
To optimize, don’t use triggers, foreign keys, stored procedures except for limiting the trips to the server.
Hope this helps.