Hacker News new | past | comments | ask | show | jobs | submit login
Amazon Aurora Now Available (amazon.com)
342 points by jeffbarr on July 28, 2015 | hide | past | favorite | 105 comments



Interesting from a licensing point of view: the GPL does not require you to redistribute changes that are 'internal' to your organization. You are not redistributing the program itself in this case, you're just letting someone access it.

For some people, this kind of end-run around the GPL is the poster child for the AGPL: https://en.wikipedia.org/wiki/Affero_General_Public_License


So the way Werner Vogels explained it at the AWS Summit in NYC recently, Aurora is built from scratch and is MySQL compatible, but it is not a fork. He mentioned that they broke it down into components which are each built on pre-existing AWS services - for example, the storage layer just DynamoDB which I thought was really interesting.


I would love to hear more about this, if you know where I can find more info.


Here's the point in the keynote where he starts talking about it - really interesting stuff! https://youtu.be/0rPpCnFE-hU?t=33m34s


There's a distinct possibility that it's not based on MySQL at all. From their prior discussions, this looks like a clean implementation that just uses its API, so even if MySQL were AGPLed, Amazon wouldn't need to release anything.


But thanks to recent rulings [1], APIs are still copyrightable and fall under the same license.

[1] http://arstechnica.com/tech-policy/2014/10/google-oracle-jav...


I believe parent is using "API" in the "wire protocol" sense, not the "intra-process programming interface" sense.

Those have not been deemed copyrightable as yet, because those don't necessarily require header files.


It's just a storage engine, albeit a very good one (I was on the beta); storage engines act as a plugin to the mysql/mariadb server layer, and are typically just a separate binary.


Can't you get around the AGPL by just making it an internal service on your network? Basically it looks like this.

Internet --> Your API service --> AGPL program

Your API service uses the AGPL to implement it's logic. Since the FSF holds that API cannot be copyrighted, Your API service implements the same API as the AGPL program and uses a network connection to the AGPL program running on another computer to implement the API. Since the Internet user never connects to the AGPL program, they are not entitled to the source code. You are required to release the source code to the users interacting with the AGPL program over the network. You therefore release the source code to the implementers of Your API service and call it a day.

Adding another layer of indirection solves a lot of problems in computer science, even GPL and AGPL


That is correct.

However, only if there is effectively the API service in the middle, making your AGPL program internal. But if you are exposing an AGPL program (let's say a database) as-a-service, you cannot have an API service in between (unless you'd re-implement all the API yourself) and hence exposing directly the AGPL database, in which case you are bound by the license with respect to the users.


Thanks for posting this, we've had some arguments with our investors and their lawyers recently about (specifically Golang, in our case) GPL and AGPL meanings and how they relate to SaaS. One case in point we have a cronspec parser package licensed AGPL (what the hell!?) which doesn't even include a web server or listener code of any kind… so are we in violation by using it to parse cronspecs that our customers have entered into our SaaS API?!)

Do you have any citation, references, or case-law I could use as a starting point to steal your arguments as my own?


The citation/case-law to use is those that define the terms distribution, adaptation, and derivative work.

With GPL, anything which isn't client side code should be fine, especially since legal advice from GPL authors have said that SaaS and GPL do not put any additional requirement on the service provider.

AGPL talk about using the work, and here the law as I have read it defined that you got to be a lawful owner in order to be permitted to copy the work into RAM. In order to be a lawful owner of the copy, you got the be in compliance with the license. How you do so is up to you, but the perceived consensus seems to put that responsibility on the SaaS provider.

Since we are talking about an SaaS API, what count as the program is sadly a grey zone. For example, Linux based operative systems has commonly a command line API but it isn't a single program. Ask yourself (or the lawyers) what a nontechnical layman would consider as a single work and what they would consider as multiple separate programs working in unison. I suspect it highly depend on what the API do, how data is flown, and how the internal source code is laid out.


> With GPL, anything which isn't client side code should be fine, especially since legal advice from GPL authors have said that SaaS and GPL do not put any additional requirement on the service provider.

Legal guidance from GPL authors is of somewhat limited utility; they aren't your lawyers, and for software where the FSF isn't the copyright holder as well as the license author, it doesn't even have the utility of being a documented representation from the copyright holder as to their intent with license they were offering.

They also are employed by an entity with a vested interested in promoting the use of the GPL.


> so are we in violation by using it to parse cronspecs that our customers have entered into our SaaS API?!

If you don't provide the source somewhere or link to it, yes, but it probably won't require you to make your entire codebase AGPL.


Has this been tested in court? Clever engineering solutions seem not to always win the day in front of a Judge.


There is no mention of Internet, just a network server. You scenario is likely covered by the license.


The AGPL is fundamentally flawed. Imagine if Google were subject to it; the idea of free software is to have the option to modify that which you run yourself. Without a million spindles, Google's main products are useless to you.

Services are different than software.


> Imagine if Google were subject to it; the idea of free software is to have the option to modify that which you run yourself. Without a million spindles, Google's main products are useless to you.

Without years of expertise, and tons of free time, you will not be able to modify the free software either. You can have someone do it on your behalf though, and just as well you could have had someone (Yahoo? Microsoft?) run Google software for you, had it been available as free software.


> Without years of expertise, and tons of free time, you will not be able to modify the free software either. You can have someone do it on your behalf though, and just as well you could have had someone (Yahoo? Microsoft?) run Google software for you, had it been available as free software.

I'd like to see something like this with regards to B corporations or non-profits. Something like Gmail, Gravatar, or the like that are services you can pay for, but they're not run for-profit. Companies can bid on the management of the service, maintenance of the underlying open source packages, and so forth.

Something between municipal fiber and Wikipedia, but for SaaS services.


Mentioned this on Twitter to the Aurora program manager but it would be awesome to see a PostgreSQL compatible frontend.

Still though, awesome achievement to build a competitive database engine. :)


> but it would be awesome to see a PostgreSQL compatible frontend

Why not just use PostgreSQL directly?


Aurora has a bunch of nice features WRT automatic failover, redundancy, and automatically scaling storage as you push more data into the DB.


Does Aurora do anything other than detect a heartbeat failure, shoot the node in the head and bring it up on another box WRT failover?


That is the basic idea on such things, though distributed protocols become complex when you have to consider split-brain, multiple concurrent failures, etc which all can occur in large-scale events.

Aurora leverages Amazon RDS Multi-AZ in this area. Their implementation has been battle-tested over many years and many many db instances.


I'm really wondering if something very close will be possible soon with AWS EFS.


A Postgres to MySQL client mapping middleware is probably the holy grail? I think it's been done in the JDBC space.


You can get this essentially via mysql_fdw: https://github.com/EnterpriseDB/mysql_fdw


Oh yes, please.

That would greatly simplify tooling and integration with RedShift (which already uses PostgreSQL frontend/protocol).


If you want to synchronize your postgres db (and a lot of other things) to redshift, we do it as a service at fivetran.com


There's no such concept of Storage Engines/Plugins in PostgreSQL. So it looks more unlikely this could happen in short time.



Isn't Aurora simply Amazon's private fork of MySQL, hosted for you?


The already have that in RDS. This is a da add of their own design. They were pragmatic enough to use the query language that most devs already know and trust, but how they pull off reads, writes, and sharing at that scale must have fundamentally different internals than a single MySQL instance.


It certainly seems to be a MySQL 5.6 fork (or at least backend ala InnoDB or similar), as the docs explicitly call out MySQL 5.6 compatibility. I feel like I wouldn't say "simply" though; that makes it seem like some guys at Amazon checked out the MySQL source, deleted some stuff and then released this.


Unlikely, because they already have a hosted MySQL 5.6 product in RDS. If anything, this is a MySQL 5.6 frontend + query processing engine with a new storage layer.


It would have been extremely difficult because MySQL community edition is GPL and Amazon does not allow GPL code to be modified by the internal teams as far as I know. They might have changed their policy though...


I've lost track of the characteristics of the players in this space. Do Rackspace or DigitalOcean have anything that even compares?

I've always used Rackspace as a sort of default. They seem to employ good people, so they must be good - so the thinking goes.

But what's the real breakdown between these three? And any others I might not know about?


AWS is a giant construction store like Home Depot, where you can find everything. Rackspace is local home improvement shop, with good service, which was doing fine until Home Depot build it's store nearby. Now people have less and less reasons to visit it. Some folks like it for the old times sake, though. DigitalOcean is your buddy who works at "Hammer&Nails Manufacturing Inc.", he can get you nice hammers and nails pretty cheap if you ask him, but not much else.

That's how I see it :)


What do you do with your analogy now that Rackspace offers official support for helping you with your AWS and Azure deployments? Are the general contractors who started the local hardware store now also working a side job at Home Depot? ;)


Local shop owner offers consulting service - go with you to the Home Depot and help you find right things for your project (and hoping next time you'll go the his shop, after you see how good he is) :)


RAX only offers support for Azure, not AWS...


Except that Rackspace revenue is still growing steadily year over year at a solid pace, so clearly not everyone agree with the "less and less reasons to visit it"


I think they are growing much slower than the industry and competition from AWS makes their margin razor thin. Latest quarter report shows that revenue growth was only 1.6% over previous quarter and profit margin of 0.6%, and their shares are down more than 60% since 2013 maximum (on the growing market). So, overall it doesn't look very good.


DO is for people who want root access to a commodity VM at the lowest price possible.

Rackspace is for people who want root access to a commodity VM and are willing to pay more for support.

AWS is for people who are willing to ditch commodity VMs and replace them with orchestrated services. (They'll sell you a commodity VM via EC2, but if that's all you want there's not a lot of reason to go with them.)

Rackspace's problem is that this segmentation puts them in the awkward position of being the wrong choice for the extremely price-sensitive (who go with DO 'cause it's cheap, or with even cheaper alternatives like Dreamhost) and the wrong choice for the "money is no object" crowd (who go with AWS, because at scale working with orchestrated services is much easier than managing huge fleets of persistent VMs). Which ends up not leaving a whole lot of people for them to sell to.


Google has CloudSQL (=MySQL) and BigQuery, which is SQL-ish.

https://cloud.google.com/sql/

https://cloud.google.com/bigquery/


Lets not talk about how much of a pain CloudSQL is to use


Really? I'd love to hear about it, please drop me a line.

(Disclaimer: I work on Google Cloud, although not on SQL specifically.)


I'd like to hear it also. We are using it and never had much to complain about.


It should be Google Cloud Datastore, not BigQuery.


The closest thing to Aurora would be Clustrix; scale at the storage level behind a single entry point.

Aurora is more than just RDBMS as a service.


You could just install a Galera cluster on DO and it'd function similarily, and tbh, that is possibly what they did.

Ignoring the benchmark BS [which is almost always BS] basically they are offering:

> With storage replicated both within and across three Availability Zones, along with an update model driven by quorum writes, Amazon Aurora is designed to deliver high performance and 99.99% availability while easily and efficiently scaling to up to 64 TB of storage.

Galera is quorum writes [basically], can scale to ~6 nodes pretty effectively [more than 6 you run into issues with write performance imho. This would be a 2/2/2 for 3 availability zone setup].

http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concep...

Galera's synchronous write process means you want everything near each other geographically, so 3 AZs in the same region would make perfect sense.

> Along the way, they verified that each Amazon Aurora instance is able to deliver on our performance target of up to 100,000 writes and 500,000 reads per second, along with a price to performance ratio that is 5 times better than previously available.

That is literally the read/write ratio you'll see on a 6 node Galera cluster. Reads scale per-node [reads are local], but writes don't [every node has to perform the write + the replication overhead].

So yeah, you can set something similar up with any provider with 3 geographically near DCs.

https://www.digitalocean.com/features/reliability/

DO has that setup in NYC and AMS, at least I assume NYC1/2/3 AMS1/2/3 are different AZ equivalents.



DigitalOcean gives you a VPS with root access, nothing else. They have some pre-built images (LAMP, Rails...) but it is your Job to get everything up an running.


This is generally preferable since it protects you from vendor lock-in.


In this case there's little lock-in, since it maintains InnoDB interface compatibility - you could drop-in replace it with, say, a DigitalOcean-hosted MySQL instance, if you wanted. (Of course you'd be missing out on other perks, but your core MySQL integration wouldn't suffer lock-in.)


Yes, not using any vendor's products protects you from vendor lock-in. To call that "generally preferable" is awfully simplistic though.


So keen on Australia to get a third availability zone so that we can use this - we were lucky enough to be invited to the preview programme but there was no point to us hosting on another continent. Very exciting!


At that pricing, I'm curious if it makes sense to put blobs in the database rather than S3. Any thoughts?


At $100/TB/mo? I don't think so. S3 is $30/TB/mo. Unless you have some compelling reason to have the blobs in the database, like indexing them, you'd be better off keeping them somewhere else.

I guess if you won't have many blobs, it may be worthwhile to save yourself engineering time of supporting both S3 and Aurora. In an ideal world, there'd just be one permanent data store with a built-in caching layer, for databases, blobs, and log files.


> I guess if you won't have many blobs, it may be worthwhile to save yourself engineering time of supporting both S3 and Aurora. In an ideal world, there'd just be one permanent data store with a built-in caching layer, for databases, blobs, and log files.

We're getting close to that. Write once your data to S3 for persistent storage, then load into Elastic Search. Mostly there.


I'd avoid it. It's expensive and MySQL is going to store all those blobs off-page anyway. S3 gives you nice perks like versions and you can just store the s3_name+version as metadata in your DB instead. Also if you want to do any heavy processing of your blobs outside of the DB, S3 tends to be a lot better for parallel access (e.g. hadoop jobs)

Obviously don't know your exact use case, but that's what we do.


That's a nice pricing, particularly paying what you actually use but it's still more than 3x the cost of just storing it in S3.


No Cloudformation support yet? I don't see it in the docs.


I'm highly disappointed by Amazon for recommending CloudFormation, but always neglecting to maintain it. Aurora has been in preview for a while, adding support for it in advance would be a great thing and coordinate the launches. As a person who fell in the trap to use the "best practice", I'm always months if not years behind in my CloudFormation stack as it takes months for CloudFormation engineers to add, let's say, a single attribute.

So, based on my experience, give it a quarter before you see it in CloudFormation! I keep bugging Jeff Barr, but it seems that he doesn't have much control over the process.

The whole development process at Amazon seems broken. Everything is developed in a silo and once GA, then other teams start integrating it, unless the two products cannot operate in isolation.

P.S. I keep checking this URL [0] several times a day and I curse several times a day, because it almost never changes!

[0] https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGui...


Amazon is involved in a lawsuit over CloudFormation and U.S. Patent No. 8,271,974. Perhaps they are hesitant to invest until that is cleared up.


They still update it, just very slowly, but I wasn't aware of this other patent troll. I guess Terraform and Fugue are in danger as well.


> Due to Amazon Aurora’s unique storage architecture, replication lag is extremely low, typically between 10 ms and 20 ms.

I think in the original announcement, this number was lower, but this is still great.

In this range of latency, the most compelling thing relevant to my needs is the possibility of using read replicas for serving read-only API calls, with minimal risk of having a write API call immediately followed by a read API call serving stale data. It's possible to orchestrate a prevention for that with regular latency-prone MySQL replication, but it carries tremendous complexity depending on the application.

Would be very interested if anyone else has explored this idea further with Aurora.


>they verified that each Amazon Aurora instance is able to deliver on our performance target of up to 100,000 writes and 500,000 reads per second

This bit caught my attention, does an "Amazon Aurora instance" means one computing instance? Or do they refer to something like your allocated share of the overall Aurora platform? Because if they are able to achieve that performance per-machine, I'm truly amazed.

Their larger instance appears to be "32 vCPUs and 244GiB Memory", that sounds credible to be able to sustain that throughput, particularly if your whole data fits in RAM, but barely. Would be nice to see R/W performance on the smaller instances.


You can read the blog post that I wrote last year when we announced Aurora to learn more about how it works:

https://aws.amazon.com/blogs/aws/highly-scalable-mysql-compa...


Do you have any plans for Geo functionality? We're working on a massive project that Aurora sounds perfect for, aside from the apparent lack of that feature.


It seems like MySQL 5.7 will start to have better GIS support. Though that is still a development release at this point. Hopefully if Amazon tries to maintain compatibility, native GIS support may eventually arrive.


it is a per-instance number. For those SysBench numbers, we are running an r3.8xlarge instance type.


Does anyone know how this benchmarks against Postgres and MySQL? Would be curious to see how TPC-C runs on this platform but I can't find any studies of any independent parties benchmarking Aurora.


It only just came out of preview, so I'd expect someone independent to test it out in the coming weeks. Maybe / hopefully one of the Percona folks?

Edit: typo (out of preview)


For TPC-C like benchmarks, you can run: 1) CloudHarmony: https://github.com/cloudharmony/oltpbench 2) Percona: https://code.launchpad.net/~perconadev/perconatools/tpcc-mys...

We've found it easier to load large datasets using CloudHarmony but have run both.

I'd recommend reading through http://d0.awsstatic.com/product-marketing/Aurora/RDS_Aurora_.... It is oriented towards SysBench, but will help you set up your clients to have enough network throughput to run a full test.

Generally, we find the performance comparison improves on large instances, high throughput workloads, or when the data set does not fit in RAM.


The AWS pages are very heavy on marketing and light on system characteristics. It would be pretty nice to know if the characteristics of Aurora are different than RDS; for example, RDS table size and index rebuilding were a big problem for a client, and I suspect that Aurora would do better, but have no reason to believe it.


Aurora is not an RDS alternative. It is a new database engine available to use on the RDS platform.


hopefully everyone else understood what I said.


We've made a number of improvements relative to MySQL, for example with large numbers of tables and the results set cache. There are some improvements on large tables and schema changes, but quite a bit more to be done in those areas. You can contact our PM team at aurora[dash]pm[at]amazon[dot]com to give us the issues that matter most. Feedback is how we prioritize!


We've been experiencing huge problems with our multi-AZ and encrypted MySQL RDS instance this morning (it looks like a hardware issue). We're contacting support, but are considering taking our entire application down and migrating to Aurora. The timing on this is too ironic.


I wonder what this is. Did they create a proper multitenant version of MySQL or are they simply running mysql without a container around it ?

I'm guessing it's the latter (much easier) option, but ... Do you really get 5x savings from simply not having a container ?


Amazon Aurora – New Cost-Effective MySQL-Compatible Database Engine for Amazon RDS

* https://aws.amazon.com/blogs/aws/highly-scalable-mysql-compa...

It's an engine for MySQL.

Aurora Database Architecture by Amazon Web Services:

* https://www.youtube.com/watch?v=-TbRxwcux3c&list=WL&index=5


It isn't MySQL at all. It just has a facade on the front compatible with MySQL's API.


Is this based on Galera?


That's my question as well - and an important one. Using Galera cluster with MySQL imposes several performance and usage constraints an end user needs to be aware of.

And if it's not Galera, how did Amazon work around the constraints of multiple writers in an ACID database, and what constraints does it impose?


It's linked in the article, here is more links:

https://news.ycombinator.com/item?id=9960276


It definitely looks like Galera (which is what powers both MySQL and MariaDB cluster implementations; it's a storage engine built atop InnoDB), but it's hard to say without more information. They mention a quorum write with automatic recovery across three nodes, but doesn't mention the method used - two phased commit, checking commits against pending transactions, etc.

It's a very complex thing to implement, and unless they have made leaps beyond what Galera has done, for some workloads it will be fast, but for others it will perform far worse than a standard MySQL instance.

Of course, I guess it could also be a cluster built upon NDB, but the lack of memory constraints on the size of the data makes that less likely.


Aurora isn't implemented based on either Galera or NDB.


Since it sounds like you have the information on what it is based upon (if only the principles which were used to address distributed ACID consistency), it would be good to get this information dissiminated - it's hard to trust that it will "just work" when we have so many examples of distributed ACID not working well.


You can think of Aurora as a single-instance database where the lower quarter is pushed down into a multi-tenant scale-out storage system. Transactions, locking, LSN generation, etc all happen at the database node. We push log records down to the storage tier and Aurora storage takes responsibility for generation of data blocks from logs.

So, the ACI components of ACID are all done at the database tier using (largely) traditional techniques. Durability is where we're using distributed systems techniques around quorums, membership management, leases, etc, with the important caveat that we have a head node generating LSNs, providing a monotonic logical clock, and avoiding those headaches.

Our physical read replicas receive redo log records, update cached entries and have readonly access to the underlying storage tier. The underlying storage is log-structured with nondestructive writes, so we can access data blocks in the past of what is current at the write master node - that's required if the replica needs to read a slightly older version of a data block for consistency reasons.

Make sense?


Absolutely not. Galera operates on synchronous transactions, Aurora operates at the storage layer.


Is 30GB of RAM the max for this service? That's rather paltry for very large joins, which seem plausible given they claim support to up to 64TB of data. Percona had recommended 144GB+ and moving to that from 64GB made our BI-type queries an order of magnitude faster. Obvi YMMV but in my experience one needs more RAM for joins before more disk for data.

That said, the UI and system for managing replication looks pretty nice. I've never done MySQL admin directly but I do appreciate how much of a pain point this can be.


No; from the product page:

"You can use Amazon RDS to scale your Amazon Aurora database instance up to 32 vCPUs and 244GiB Memory. You can also add up to 15 Amazon Aurora Replicas across three availability zones to further scale read capacity. Amazon Aurora automatically grows storage as needed, from 10GB up to 64TB."


From the article: "Instances are available in 5 sizes, with 2 to 32 vCPUs and 15.25 to 244 GiB of memory."


Can anybody with any experience compare this vs RDS for me?


Aurora is an engine for RDS.


Offtopic a bit but I'm genuinely curious: Jeff, why did you post this at 10pm Pacific?


Because that's when it was ready. Our launch process is fairly involved -- getting all of the code out to production servers across multiple regions, updating the console, pushing the docs, and so forth. The last three steps for a big release are push the press release, push the blog post, and push the social media.


That's pretty much his job and he's very passionate about it.


You got it!


and that's all there is right now..


Because he feels like it.


At these prices I would rather install MySQL on Hetzner myself at a fraction of a cost.


Every time I see a comment like this, I get the feeling that the poster doesn't understand how "the cloud" works.

Everybody knows that renting a dedicated server (or purchasing your own servers) is generally a more cost effective approach. Cloud computing is used to accomplish different goals when used appropriately.

Sure you could install MySQL on a box at Hetzner, but what happens when it goes down? Do you feel like maintaining the box for security and updates? What about log rotation? What if you need to migrate to a different geographical region, or need to increase capacity?

Smaller sites might be okay with managing those on their own, but as you grow, being able to offload that work to Amazon (who tend to know what they are doing more than the average joe) is a benefit when you have other business-logic issues to deal with.

So yes, naively getting a box at Hetzner or any other dedicated provider would be less of an upfront monetary cost, but in the long run, the advantages aren't as clear cut.


I think Aurora doesn't compare to any other host-it-yourself solution, being managed and 'replicated' across multiple datacenters.

Plus, Hetzner, really?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: