Unraveling Multimodality with Large Language Models.pdf
New Data Transfer Tools for Hadoop: Sqoop 2
1. A
New
GeneraAon
of
Data
Transfer
Tools
for
Hadoop:
Sqoop
2
Bilung
Lee
(blee
at
cloudera
dot
com)
Kathleen
Ting
(kathleen
at
cloudera
dot
com)
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
Copyright
2012
The
Apache
So=ware
FoundaAon
2. Who
Are
We?
• Bilung
Lee
– Apache
Sqoop
CommiQer
– So=ware
Engineer,
Cloudera
• Kathleen
Ting
– Apache
Sqoop
CommiQer
– Support
Manager,
Cloudera
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
2
Copyright
2012
The
Apache
So=ware
FoundaAon
3. What
is
Sqoop?
• Bulk
data
transfer
tool
– Import/Export
from/to
relaAonal
databases,
enterprise
data
warehouses,
and
NoSQL
systems
– Populate
tables
in
HDFS,
Hive,
and
HBase
– Integrate
with
Oozie
as
an
acAon
– Support
plugins
via
connector
based
architecture
May ‘09 March ‘10 August ‘11 April ‘12
First version Moved to Moved to Apache
(HADOOP-5815) GitHub Apache Top Level Project
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
3
Copyright
2012
The
Apache
So=ware
FoundaAon
4. Sqoop
1
Architecture
Document
Enterprise Based
Data Systems
Warehouse
Relational
Database
command
Hadoop
Map Task
Sqoop
HDFS/HBase/
Hive
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
4
Copyright
2012
The
Apache
So=ware
FoundaAon
5. Sqoop
1
Challenges
• CrypAc,
contextual
command
line
arguments
• Tight
coupling
between
data
transfer
and
output
format
• Security
concerns
with
openly
shared
credenAals
• Not
easy
to
manage
installaAon/configuraAon
• Connectors
are
forced
to
follow
JDBC
model
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
5
Copyright
2012
The
Apache
So=ware
FoundaAon
7. Sqoop
2
Themes
• Ease
of
Use
• Ease
of
Extension
• Security
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
7
Copyright
2012
The
Apache
So=ware
FoundaAon
8. Sqoop
2
Themes
• Ease
of
Use
• Ease
of
Extension
• Security
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
8
Copyright
2012
The
Apache
So=ware
FoundaAon
9. Ease
of
Use
Sqoop
1
Sqoop
2
Client-‐only
Architecture
Client/Server
Architecture
CLI
based
CLI
+
Web
based
Client
access
to
Hive,
HBase
Server
access
to
Hive,
HBase
Oozie
and
Sqoop
Aghtly
coupled
Oozie
finds
REST
API
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
9
Copyright
2012
The
Apache
So=ware
FoundaAon
10. Sqoop
1:
Client-‐side
Tool
• Client-‐side
installaAon
+
configuraAon
– Connectors
are
installed/configured
locally
– Local
requires
root
privileges
– JDBC
drivers
are
needed
locally
– Database
connecAvity
is
needed
locally
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
10
Copyright
2012
The
Apache
So=ware
FoundaAon
11. Sqoop
2:
Sqoop
as
a
Service
• Server-‐side
installaAon
+
configuraAon
– Connectors
are
installed/configured
in
one
place
– Managed
by
administrator
and
run
by
operator
– JDBC
drivers
are
needed
in
one
place
– Database
connecAvity
is
needed
on
the
server
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
11
Copyright
2012
The
Apache
So=ware
FoundaAon
12. Client
Interface
• Sqoop
1
client
interface:
– Command
line
interface
(CLI)
based
– Can
be
automated
via
scripAng
• Sqoop
2
client
interface:
– CLI
based
(in
either
interacAve
or
script
mode)
– Web
based
(remotely
accessible)
– REST
API
is
exposed
for
external
tool
integraAon
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
12
Copyright
2012
The
Apache
So=ware
FoundaAon
13. Sqoop
1:
Service
Level
IntegraAon
• Hive,
HBase
– Require
local
installaAon
• Oozie
– von
Neumann(esque)
integraAon:
• Package
Sqoop
as
an
acAon
• Then
run
Sqoop
from
node
machines,
causing
one
MR
job
to
be
dependent
on
another
MR
job
• Error-‐prone,
difficult
to
debug
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
13
Copyright
2012
The
Apache
So=ware
FoundaAon
15. Ease
of
Use
Sqoop
1
Sqoop
2
Client-‐only
Architecture
Client/Server
Architecture
CLI
based
CLI
+
Web
based
Client
access
to
Hive,
HBase
Server
access
to
Hive,
HBase
Oozie
and
Sqoop
Aghtly
coupled
Oozie
finds
REST
API
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
15
Copyright
2012
The
Apache
So=ware
FoundaAon
16. Sqoop
2
Themes
• Ease
of
Use
• Ease
of
Extension
• Security
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
16
Copyright
2012
The
Apache
So=ware
FoundaAon
17. Ease
of
Extension
Sqoop
1
Sqoop
2
Connector
forced
to
follow
JDBC
model
Connector
given
free
rein
Connectors
must
implement
funcAonality
Connectors
benefit
from
common
framework
of
funcAonality
Connector
selecAon
is
implicit
Connector
selecAon
is
explicit
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
17
Copyright
2012
The
Apache
So=ware
FoundaAon
18. Sqoop
1:
ImplemenAng
Connectors
• Connectors
are
forced
to
follow
JDBC
model
– Connectors
are
limited/required
to
use
common
JDBC
vocabulary
(URL,
database,
table,
etc)
• Connectors
must
implement
all
Sqoop
funcAonality
they
want
to
support
– New
funcAonality
may
not
be
available
for
previously
implemented
connectors
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
18
Copyright
2012
The
Apache
So=ware
FoundaAon
19. Sqoop
2:
ImplemenAng
Connectors
• Connectors
are
not
restricted
to
JDBC
model
– Connectors
can
define
own
domain
• Common
funcAonality
are
abstracted
out
of
connectors
– Connectors
are
only
responsible
for
data
transfer
– Common
Reduce
phase
implements
data
transformaAon
and
system
integraAon
– Connectors
can
benefit
from
future
development
of
common
funcAonality
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
19
Copyright
2012
The
Apache
So=ware
FoundaAon
20. Different
OpAons,
Different
Results
Which
is
running
MySQL?
$ sqoop import --connect jdbc:mysql://localhost/db
--username foo --table TEST
$ sqoop import --connect jdbc:mysql://localhost/db
--driver com.mysql.jdbc.Driver --username foo --table TEST
• Different
opAons
may
lead
to
unpredictable
results
– Sqoop
2
requires
explicit
selecAon
of
a
connector,
thus
disambiguaAng
the
process
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
20
Copyright
2012
The
Apache
So=ware
FoundaAon
21. Sqoop
1:
Using
Connectors
• Choice
of
connector
is
implicit
– In
a
simple
case,
based
on
the
URL
in
-‐-‐connect
string
to
access
the
database
– SpecificaAon
of
different
opAons
can
lead
to
different
connector
selecAon
– Error-‐prone
but
good
for
power
users
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
21
Copyright
2012
The
Apache
So=ware
FoundaAon
22. Sqoop
1:
Using
Connectors
• Require
knowledge
of
database
idiosyncrasies
– e.g.
Couchbase
does
not
need
to
specify
a
table
name,
which
is
required,
causing
-‐-‐table
to
get
overloaded
as
backfill
or
dump
operaAon
– e.g.
-‐-‐null-‐string
representaAon
is
not
supported
by
all
connectors
• FuncAonality
is
limited
to
what
the
implicitly
chosen
connector
supports
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
22
Copyright
2012
The
Apache
So=ware
FoundaAon
23. Sqoop
2:
Using
Connectors
• Users
make
explicit
connector
choice
– Less
error-‐prone,
more
predictable
• Users
need
not
be
aware
of
the
funcAonality
of
all
connectors
– Couchbase
users
need
not
care
that
other
connectors
use
tables
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
23
Copyright
2012
The
Apache
So=ware
FoundaAon
24. Sqoop
2:
Using
Connectors
• Common
funcAonality
is
available
to
all
connectors
– Connectors
need
not
worry
about
common
downstream
funcAonality,
such
as
transformaAon
into
various
formats
and
integraAon
with
other
systems
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
24
Copyright
2012
The
Apache
So=ware
FoundaAon
25. Ease
of
Extension
Sqoop
1
Sqoop
2
Connector
forced
to
follow
JDBC
model
Connector
given
free
rein
Connectors
must
implement
funcAonality
Connectors
benefit
from
common
framework
of
funcAonality
Connector
selecAon
is
implicit
Connector
selecAon
is
explicit
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
25
Copyright
2012
The
Apache
So=ware
FoundaAon
26. Sqoop
2
Themes
• Ease
of
Use
• Ease
of
Extension
• Security
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
26
Copyright
2012
The
Apache
So=ware
FoundaAon
27. Security
Sqoop
1
Sqoop
2
Support
only
for
Hadoop
security
Support
for
Hadoop
security
and
role-‐
based
access
control
to
external
systems
High
risk
of
abusing
access
to
external
Reduced
risk
of
abusing
access
to
external
systems
systems
No
resource
management
policy
Resource
management
policy
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
27
Copyright
2012
The
Apache
So=ware
FoundaAon
28. Sqoop
1:
Security
• Inherit/Propagate
Kerberos
principal
for
the
jobs
it
launches
• Access
to
files
on
HDFS
can
be
controlled
via
HDFS
security
• Limited
support
(user/password)
for
secure
access
to
external
systems
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
28
Copyright
2012
The
Apache
So=ware
FoundaAon
29. Sqoop
2:
Security
• Inherit/Propagate
Kerberos
principal
for
the
jobs
it
launches
• Access
to
files
on
HDFS
can
be
controlled
via
HDFS
security
• Support
for
secure
access
to
external
systems
via
role-‐based
access
to
connecAon
objects
– Administrators
create/edit/delete
connecAons
– Operators
use
connecAons
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
29
Copyright
2012
The
Apache
So=ware
FoundaAon
30. Sqoop
1:
External
System
Access
• Every
invocaAon
requires
necessary
credenAals
to
access
external
systems
(e.g.
relaAonal
database)
– Workaround:
create
a
user
with
limited
access
in
lieu
of
giving
out
password
• Does
not
scale
• Permission
granularity
is
hard
to
obtain
• Hard
to
prevent
misuse
once
credenAals
are
given
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
30
Copyright
2012
The
Apache
So=ware
FoundaAon
31. Sqoop
2:
External
System
Access
• ConnecAons
are
enabled
as
first-‐class
objects
– ConnecAons
encompass
credenAals
– ConnecAons
are
created
once
and
then
used
many
Ames
for
various
import/export
jobs
– ConnecAons
are
created
by
administrator
and
used
by
operator
• Safeguard
credenAal
access
from
end
users
• ConnecAons
can
be
restricted
in
scope
based
on
operaAon
(import/export)
– Operators
cannot
abuse
credenAals
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
31
Copyright
2012
The
Apache
So=ware
FoundaAon
32. Sqoop
1:
Resource
Management
• No
explicit
resource
management
policy
– Users
specify
the
number
of
map
jobs
to
run
– Cannot
throQle
load
on
external
systems
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
32
Copyright
2012
The
Apache
So=ware
FoundaAon
33. Sqoop
2:
Resource
Management
• ConnecAons
allow
specificaAon
of
resource
management
policy
– Administrators
can
limit
the
total
number
of
physical
connecAons
open
at
one
Ame
– ConnecAons
can
also
be
disabled
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
33
Copyright
2012
The
Apache
So=ware
FoundaAon
34. Security
Sqoop
1
Sqoop
2
Support
only
for
Hadoop
security
Support
for
Hadoop
security
and
role-‐
based
access
control
to
external
systems
High
risk
of
abusing
access
to
external
Reduced
risk
of
abusing
access
to
external
systems
systems
No
resource
management
policy
Resource
management
policy
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
34
Copyright
2012
The
Apache
So=ware
FoundaAon
40. Takeaway
Sqoop
2
Highights:
– Ease
of
Use:
Sqoop
as
a
Service
– Ease
of
Extension:
Connectors
benefit
from
shared
funcAonality
– Security:
ConnecAons
as
first-‐class
objects
and
role-‐based
security
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
40
Copyright
2012
The
Apache
So=ware
FoundaAon
41. Current
Status:
work-‐in-‐progress
• Sqoop2
Development:
hQp://issues.apache.org/jira/browse/SQOOP-‐365
• Sqoop2
Blog
Post:
hQp://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop
• Sqoop2
Design:
hQp://cwiki.apache.org/confluence/display/SQOOP/Sqoop+2
Hadoop
Summit
2012.
6/13/12
Apache
Sqoop
41
Copyright
2012
The
Apache
So=ware
FoundaAon