Interesting from a licensing point of view: the GPL does not require you to redistribute changes that are 'internal' to your organization. You are not redistributing the program itself in this case, you're just letting someone access it.
So the way Werner Vogels explained it at the AWS Summit in NYC recently, Aurora is built from scratch and is MySQL compatible, but it is not a fork. He mentioned that they broke it down into components which are each built on pre-existing AWS services - for example, the storage layer just DynamoDB which I thought was really interesting.
There's a distinct possibility that it's not based on MySQL at all. From their prior discussions, this looks like a clean implementation that just uses its API, so even if MySQL were AGPLed, Amazon wouldn't need to release anything.
It's just a storage engine, albeit a very good one (I was on the beta); storage engines act as a plugin to the mysql/mariadb server layer, and are typically just a separate binary.
Can't you get around the AGPL by just making it an internal service on your network? Basically it looks like this.
Internet --> Your API service --> AGPL program
Your API service uses the AGPL to implement it's logic. Since the FSF holds that API cannot be copyrighted, Your API service implements the same API as the AGPL program and uses a network connection to the AGPL program running on another computer to implement the API. Since the Internet user never connects to the AGPL program, they are not entitled to the source code. You are required to release the source code to the users interacting with the AGPL program over the network. You therefore release the source code to the implementers of Your API service and call it a day.
Adding another layer of indirection solves a lot of problems in computer science, even GPL and AGPL
However, only if there is effectively the API service in the middle, making your AGPL program internal. But if you are exposing an AGPL program (let's say a database) as-a-service, you cannot have an API service in between (unless you'd re-implement all the API yourself) and hence exposing directly the AGPL database, in which case you are bound by the license with respect to the users.
Thanks for posting this, we've had some arguments with our investors and their lawyers recently about (specifically Golang, in our case) GPL and AGPL meanings and how they relate to SaaS. One case in point we have a cronspec parser package licensed AGPL (what the hell!?) which doesn't even include a web server or listener code of any kind… so are we in violation by using it to parse cronspecs that our customers have entered into our SaaS API?!)
Do you have any citation, references, or case-law I could use as a starting point to steal your arguments as my own?
The citation/case-law to use is those that define the terms distribution, adaptation, and derivative work.
With GPL, anything which isn't client side code should be fine, especially since legal advice from GPL authors have said that SaaS and GPL do not put any additional requirement on the service provider.
AGPL talk about using the work, and here the law as I have read it defined that you got to be a lawful owner in order to be permitted to copy the work into RAM. In order to be a lawful owner of the copy, you got the be in compliance with the license. How you do so is up to you, but the perceived consensus seems to put that responsibility on the SaaS provider.
Since we are talking about an SaaS API, what count as the program is sadly a grey zone. For example, Linux based operative systems has commonly a command line API but it isn't a single program. Ask yourself (or the lawyers) what a nontechnical layman would consider as a single work and what they would consider as multiple separate programs working in unison. I suspect it highly depend on what the API do, how data is flown, and how the internal source code is laid out.
> With GPL, anything which isn't client side code should be fine, especially since legal advice from GPL authors have said that SaaS and GPL do not put any additional requirement on the service provider.
Legal guidance from GPL authors is of somewhat limited utility; they aren't your lawyers, and for software where the FSF isn't the copyright holder as well as the license author, it doesn't even have the utility of being a documented representation from the copyright holder as to their intent with license they were offering.
They also are employed by an entity with a vested interested in promoting the use of the GPL.
The AGPL is fundamentally flawed. Imagine if Google were subject to it; the idea of free software is to have the option to modify that which you run yourself. Without a million spindles, Google's main products are useless to you.
> Imagine if Google were subject to it; the idea of free software is to have the option to modify that which you run yourself. Without a million spindles, Google's main products are useless to you.
Without years of expertise, and tons of free time, you will not be able to modify the free software either. You can have someone do it on your behalf though, and just as well you could have had someone (Yahoo? Microsoft?) run Google software for you, had it been available as free software.
> Without years of expertise, and tons of free time, you will not be able to modify the free software either. You can have someone do it on your behalf though, and just as well you could have had someone (Yahoo? Microsoft?) run Google software for you, had it been available as free software.
I'd like to see something like this with regards to B corporations or non-profits. Something like Gmail, Gravatar, or the like that are services you can pay for, but they're not run for-profit. Companies can bid on the management of the service, maintenance of the underlying open source packages, and so forth.
Something between municipal fiber and Wikipedia, but for SaaS services.
That is the basic idea on such things, though distributed protocols become complex when you have to consider split-brain, multiple concurrent failures, etc which all can occur in large-scale events.
Aurora leverages Amazon RDS Multi-AZ in this area. Their implementation has been battle-tested over many years and many many db instances.
The already have that in RDS. This is a da add of their own design. They were pragmatic enough to use the query language that most devs already know and trust, but how they pull off reads, writes, and sharing at that scale must have fundamentally different internals than a single MySQL instance.
It certainly seems to be a MySQL 5.6 fork (or at least backend ala InnoDB or similar), as the docs explicitly call out MySQL 5.6 compatibility. I feel like I wouldn't say "simply" though; that makes it seem like some guys at Amazon checked out the MySQL source, deleted some stuff and then released this.
Unlikely, because they already have a hosted MySQL 5.6 product in RDS. If anything, this is a MySQL 5.6 frontend + query processing engine with a new storage layer.
It would have been extremely difficult because MySQL community edition is GPL and Amazon does not allow GPL code to be modified by the internal teams as far as I know. They might have changed their policy though...
AWS is a giant construction store like Home Depot, where you can find everything. Rackspace is local home improvement shop, with good service, which was doing fine until Home Depot build it's store nearby. Now people have less and less reasons to visit it. Some folks like it for the old times sake, though. DigitalOcean is your buddy who works at "Hammer&Nails Manufacturing Inc.", he can get you nice hammers and nails pretty cheap if you ask him, but not much else.
What do you do with your analogy now that Rackspace offers official support for helping you with your AWS and Azure deployments? Are the general contractors who started the local hardware store now also working a side job at Home Depot? ;)
Local shop owner offers consulting service - go with you to the Home Depot and help you find right things for your project (and hoping next time you'll go the his shop, after you see how good he is) :)
Except that Rackspace revenue is still growing steadily year over year at a solid pace, so clearly not everyone agree with the "less and less reasons to visit it"
I think they are growing much slower than the industry and competition from AWS makes their margin razor thin. Latest quarter report shows that revenue growth was only 1.6% over previous quarter and profit margin of 0.6%, and their shares are down more than 60% since 2013 maximum (on the growing market). So, overall it doesn't look very good.
DO is for people who want root access to a commodity VM at the lowest price possible.
Rackspace is for people who want root access to a commodity VM and are willing to pay more for support.
AWS is for people who are willing to ditch commodity VMs and replace them with orchestrated services. (They'll sell you a commodity VM via EC2, but if that's all you want there's not a lot of reason to go with them.)
Rackspace's problem is that this segmentation puts them in the awkward position of being the wrong choice for the extremely price-sensitive (who go with DO 'cause it's cheap, or with even cheaper alternatives like Dreamhost) and the wrong choice for the "money is no object" crowd (who go with AWS, because at scale working with orchestrated services is much easier than managing huge fleets of persistent VMs). Which ends up not leaving a whole lot of people for them to sell to.
You could just install a Galera cluster on DO and it'd function similarily, and tbh, that is possibly what they did.
Ignoring the benchmark BS [which is almost always BS] basically they are offering:
> With storage replicated both within and across three Availability Zones, along with an update model driven by quorum writes, Amazon Aurora is designed to deliver high performance and 99.99% availability while easily and efficiently scaling to up to 64 TB of storage.
Galera is quorum writes [basically], can scale to ~6 nodes pretty effectively [more than 6 you run into issues with write performance imho. This would be a 2/2/2 for 3 availability zone setup].
Galera's synchronous write process means you want everything near each other geographically, so 3 AZs in the same region would make perfect sense.
> Along the way, they verified that each Amazon Aurora instance is able to deliver on our performance target of up to 100,000 writes and 500,000 reads per second, along with a price to performance ratio that is 5 times better than previously available.
That is literally the read/write ratio you'll see on a 6 node Galera cluster. Reads scale per-node [reads are local], but writes don't [every node has to perform the write + the replication overhead].
So yeah, you can set something similar up with any provider with 3 geographically near DCs.
DigitalOcean gives you a VPS with root access, nothing else.
They have some pre-built images (LAMP, Rails...) but it is your Job to get everything up an running.
In this case there's little lock-in, since it maintains InnoDB interface compatibility - you could drop-in replace it with, say, a DigitalOcean-hosted MySQL instance, if you wanted. (Of course you'd be missing out on other perks, but your core MySQL integration wouldn't suffer lock-in.)
So keen on Australia to get a third availability zone so that we can use this - we were lucky enough to be invited to the preview programme but there was no point to us hosting on another continent. Very exciting!
At $100/TB/mo? I don't think so. S3 is $30/TB/mo. Unless you have some compelling reason to have the blobs in the database, like indexing them, you'd be better off keeping them somewhere else.
I guess if you won't have many blobs, it may be worthwhile to save yourself engineering time of supporting both S3 and Aurora. In an ideal world, there'd just be one permanent data store with a built-in caching layer, for databases, blobs, and log files.
> I guess if you won't have many blobs, it may be worthwhile to save yourself engineering time of supporting both S3 and Aurora. In an ideal world, there'd just be one permanent data store with a built-in caching layer, for databases, blobs, and log files.
We're getting close to that. Write once your data to S3 for persistent storage, then load into Elastic Search. Mostly there.
I'd avoid it. It's expensive and MySQL is going to store all those blobs off-page anyway. S3 gives you nice perks like versions and you can just store the s3_name+version as metadata in your DB instead. Also if you want to do any heavy processing of your blobs outside of the DB, S3 tends to be a lot better for parallel access (e.g. hadoop jobs)
Obviously don't know your exact use case, but that's what we do.
I'm highly disappointed by Amazon for recommending CloudFormation, but always neglecting to maintain it. Aurora has been in preview for a while, adding support for it in advance would be a great thing and coordinate the launches. As a person who fell in the trap to use the "best practice", I'm always months if not years behind in my CloudFormation stack as it takes months for CloudFormation engineers to add, let's say, a single attribute.
So, based on my experience, give it a quarter before you see it in CloudFormation! I keep bugging Jeff Barr, but it seems that he doesn't have much control over the process.
The whole development process at Amazon seems broken. Everything is developed in a silo and once GA, then other teams start integrating it, unless the two products cannot operate in isolation.
P.S. I keep checking this URL [0] several times a day and I curse several times a day, because it almost never changes!
> Due to Amazon Aurora’s unique storage architecture, replication lag is extremely low, typically between 10 ms and 20 ms.
I think in the original announcement, this number was lower, but this is still great.
In this range of latency, the most compelling thing relevant to my needs is the possibility of using read replicas for serving read-only API calls, with minimal risk of having a write API call immediately followed by a read API call serving stale data. It's possible to orchestrate a prevention for that with regular latency-prone MySQL replication, but it carries tremendous complexity depending on the application.
Would be very interested if anyone else has explored this idea further with Aurora.
>they verified that each Amazon Aurora instance is able to deliver on our performance target of up to 100,000 writes and 500,000 reads per second
This bit caught my attention, does an "Amazon Aurora instance" means one computing instance? Or do they refer to something like your allocated share of the overall Aurora platform? Because if they are able to achieve that performance per-machine, I'm truly amazed.
Their larger instance appears to be "32 vCPUs and 244GiB Memory", that sounds credible to be able to sustain that throughput, particularly if your whole data fits in RAM, but barely. Would be nice to see R/W performance on the smaller instances.
Do you have any plans for Geo functionality? We're working on a massive project that Aurora sounds perfect for, aside from the apparent lack of that feature.
It seems like MySQL 5.7 will start to have better GIS support. Though that is still a development release at this point. Hopefully if Amazon tries to maintain compatibility, native GIS support may eventually arrive.
Does anyone know how this benchmarks against Postgres and MySQL? Would be curious to see how TPC-C runs on this platform but I can't find any studies of any independent parties benchmarking Aurora.
The AWS pages are very heavy on marketing and light on system characteristics. It would be pretty nice to know if the characteristics of Aurora are different than RDS; for example, RDS table size and index rebuilding were a big problem for a client, and I suspect that Aurora would do better, but have no reason to believe it.
We've made a number of improvements relative to MySQL, for example with large numbers of tables and the results set cache. There are some improvements on large tables and schema changes, but quite a bit more to be done in those areas. You can contact our PM team at aurora[dash]pm[at]amazon[dot]com to give us the issues that matter most. Feedback is how we prioritize!
We've been experiencing huge problems with our multi-AZ and encrypted MySQL RDS instance this morning (it looks like a hardware issue). We're contacting support, but are considering taking our entire application down and migrating to Aurora. The timing on this is too ironic.
That's my question as well - and an important one. Using Galera cluster with MySQL imposes several performance and usage constraints an end user needs to be aware of.
And if it's not Galera, how did Amazon work around the constraints of multiple writers in an ACID database, and what constraints does it impose?
It definitely looks like Galera (which is what powers both MySQL and MariaDB cluster implementations; it's a storage engine built atop InnoDB), but it's hard to say without more information. They mention a quorum write with automatic recovery across three nodes, but doesn't mention the method used - two phased commit, checking commits against pending transactions, etc.
It's a very complex thing to implement, and unless they have made leaps beyond what Galera has done, for some workloads it will be fast, but for others it will perform far worse than a standard MySQL instance.
Of course, I guess it could also be a cluster built upon NDB, but the lack of memory constraints on the size of the data makes that less likely.
Since it sounds like you have the information on what it is based upon (if only the principles which were used to address distributed ACID consistency), it would be good to get this information dissiminated - it's hard to trust that it will "just work" when we have so many examples of distributed ACID not working well.
You can think of Aurora as a single-instance database where the lower quarter is pushed down into a multi-tenant scale-out storage system. Transactions, locking, LSN generation, etc all happen at the database node. We push log records down to the storage tier and Aurora storage takes responsibility for generation of data blocks from logs.
So, the ACI components of ACID are all done at the database tier using (largely) traditional techniques. Durability is where we're using distributed systems techniques around quorums, membership management, leases, etc, with the important caveat that we have a head node generating LSNs, providing a monotonic logical clock, and avoiding those headaches.
Our physical read replicas receive redo log records, update cached entries and have readonly access to the underlying storage tier. The underlying storage is log-structured with nondestructive writes, so we can access data blocks in the past of what is current at the write master node - that's required if the replica needs to read a slightly older version of a data block for consistency reasons.
Is 30GB of RAM the max for this service? That's rather paltry for very large joins, which seem plausible given they claim support to up to 64TB of data. Percona had recommended 144GB+ and moving to that from 64GB made our BI-type queries an order of magnitude faster. Obvi YMMV but in my experience one needs more RAM for joins before more disk for data.
That said, the UI and system for managing replication looks pretty nice. I've never done MySQL admin directly but I do appreciate how much of a pain point this can be.
"You can use Amazon RDS to scale your Amazon Aurora database instance up to 32 vCPUs and 244GiB Memory. You can also add up to 15 Amazon Aurora Replicas across three availability zones to further scale read capacity. Amazon Aurora automatically grows storage as needed, from 10GB up to 64TB."
Because that's when it was ready. Our launch process is fairly involved -- getting all of the code out to production servers across multiple regions, updating the console, pushing the docs, and so forth. The last three steps for a big release are push the press release, push the blog post, and push the social media.
Every time I see a comment like this, I get the feeling that the poster doesn't understand how "the cloud" works.
Everybody knows that renting a dedicated server (or purchasing your own servers) is generally a more cost effective approach. Cloud computing is used to accomplish different goals when used appropriately.
Sure you could install MySQL on a box at Hetzner, but what happens when it goes down? Do you feel like maintaining the box for security and updates? What about log rotation? What if you need to migrate to a different geographical region, or need to increase capacity?
Smaller sites might be okay with managing those on their own, but as you grow, being able to offload that work to Amazon (who tend to know what they are doing more than the average joe) is a benefit when you have other business-logic issues to deal with.
So yes, naively getting a box at Hetzner or any other dedicated provider would be less of an upfront monetary cost, but in the long run, the advantages aren't as clear cut.
For some people, this kind of end-run around the GPL is the poster child for the AGPL: https://en.wikipedia.org/wiki/Affero_General_Public_License