report/chapters/2-lit-r.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234

\chapter{Background and Literature Review}

\epigraph{Books serve to show a man that those original thoughts of
  his aren't very new after all}{\textit{Abraham Lincoln}}

The idea of unifying the storage provided by multiple Internet file
storage providers and storing all the content in an encrypted form is
not new, computer researchers and programmers have devised different
methods to use multiple file storage providers' storage space. This
chapter gives an overview of the work done by Yeo et al. in unifying
the storage provided by Dropbox, Box, Google Drive and Skydrive on
Android devices\cite{yeo}(Section \ref{2-yeo-sec}); SkyCDS, a content
delivery service, by Gonzalez et al., which uses publish/subscribe
overlay paradigm and stores the content across multiple cloud storage
providers such that only part of the content (in encrypted form) is
stored on each file storage provider\cite{skycds}(Section
\ref{2-skycds-sec}); lastly, \verb+git-annex+, by Joey
Hess\cite{person:joeyh}, that allows one to version control and keep
track of large files with a possibility of encrypting files that are
stored in ``special remotes'' -- storage provided by Internet file
storage providers (Section \ref{2-gitannex-sec}).

\section{Multi Cloud Storage Prototype}\label{2-yeo-sec}

In their paper ``Leveraging client-side storage techniques for
enhanced use of multiple consumer cloud storage services on
resource-constrained mobile devices'', Yeo et al. show their Android
mobile application, a prototype, which unifies storage provided by
Dropbox, Box, Google Drive and SkyDrive. The application allows the
user to store all their information in a single location on their
phone and it uses erasure coding\cite{weatherspoon} to split each file
into \verb`n + k` fragments and spreads the encrypted fragments across
storage provided by the file storage providers. All basic file
operations -- Create, Rename, Update, Delete (CRUD) -- are
possible. Information about the files stored in the unified location
is stored in a SQLite database. Unlike combox, which depends the file
storage provider' client to sync file fragments/shards to the file
storage provider's server, the Android application developed by Yeo et
al. takes the responsibility to sync file fragments/shards to each
file storage provider and uses the OAuth 2.0\cite{protocol:oauth2}
protocol for authorization.

For encrypting file fragments, they use AES-256; the key for
encrypting file fragments is derived from the user's password by using
Password-Based Key Derivation Function (PBKDF2)\cite{kaliski}. For
erasure coding they use the JigDFS library\cite{jigdfs}. The Android
application is able do ``progressive streaming'' of media files; this
means that large media files can be streamed in real-time from the
from the file storage providers' servers; this is an attractive
feature in a ``resource constrained'' device where storage is
expensive.

Yeo et al. propose methods for achieving data de-duplication; file
compression based on the type of the file; intelligent pre-fetching
and caching of file fragments and ``automatic restoration in
exploiting file-versioning''; these features were not implemented in
the prototype Android application and there is possibility of Yeo et
al. implementing these features in the future.

It becomes apparent that Yeo et al.' work is of immense importance when
we take into consideration the research done by Yang et al., which
found that 59\% of the users who use ``cloud storage service'' access
the service through a smart phone and 42.2\% users access it for
audio/video\cite{yang}. The research by Yang et al. definitely
suggests a trend of users' preference for small hand-held computers
over laptops and desktops.

\section{SkyCDS}\label{2-skycds-sec}

SkyCDS, by Gonzalez et al., is a content delivery system that splits
and spreads the content across multiple file storage
providers\cite{skycds}. According to Gonzalez et al., the main reason
for designing and developing SkyCDS was to prevent content providers
from getting locked into just one file storage provider and to
minimize loss when a file storage provider goes out of business or if
there is temporary outage in the storage service provided by the file
storage provider.

In SkyCDS, the content delivery to subscribers of the content is
segregated into two distinct layers -- Metadata Flow Layer and the
Content Flow Layer. The publisher of the content largely interacts
with the Metadata Flow Layer that controls and keeps track of the what
content is published and the subscriber also largely interacts with
the Metadata Flow layer to subscribe to content published in the
content delivery system. The Content Flow Layer is where the content
is stored across multiple file storage providers. The publisher is
responsible for publishing the content using the ``delivery workflow''
(part of the Content Flow Layer) and the subscriber uses the
``retrieve workflow'' to get access to the subscribed content.

When content has to be dispersed to $k$ file storage providers, the
content is split into $n$ chunks, $n > k$, this file splitting seems
to produce 66.7\% of redundancy overhead\cite{skycds}; this file
splitting scheme looks very similar to erasure coding, but Gonzalez et
al. don't explicitly state that the content splitting scheme is indeed
``erasure coding''. The splitting of content is done by the ``delivery
workflow'' engine which is invoked when the publisher triggers the
action to publish the respective content to subscribers.

To evaluate the effectiveness of SkyCDS, Gonzalez et al. state that
they've done a case study using the data obtained from the European
Space Astronomy Center (ESAC) for the Soil Moisture Ocean Salinity. In
this study, a group of organizations, in two different continents,
used SkyCDS to share satellite images with each other. According to
Gonzalez et al. this study attested SkyCDS as a viable option for
content delivery with respective to performance, cost of file storage
space and reliability.

\section{git-annex}\label{2-gitannex-sec}

\verb+git-annex+ allows one to version controlled large files that are
not usually feasible to version control under
\verb+git+\cite{program:git}. \verb+git-annex+, checks in the names
and other meta-data about the files in git and stores the actual
content under \verb+.git/annex+ directory. When a file is added to
\verb+git-annex+, a symlink of the file is created in place of the
file and the content of the file itself is stored under the
\verb+.git/annex+ directory.

For instance, say there is a file called
\verb+deb-nicholson-80s.medium.webm+ that was downloaded from the
Internet to the \verb+git-annex+ directory:

\begin{verbatim}
↳ git status
On branch master
Untracked files:
  (use "git add <file>..." to include in what will be committed)

   deb-nicholson-80s.medium.webm

↳ ls -l
total 105708
...
-rw-r--r-- 1 rsd rsd 108196923 May  5  2015 deb-nicholson-80s.medium.webm
...
\end{verbatim}

When this file is added to \verb+git-annex+ with \verb+git annex add+,
the file turns into a symlink to a file under the \verb+.git/annex+
directory:

{\small
\begin{verbatim}
↳ git annex add deb-nicholson-80s.medium.webm
add deb-nicholson-80s.medium.webm ok
(recording state in git...)

↳ ls -l
...
lrwxrwxrwx 1 rsd rsd   207 May  5  2015 deb-nicholson-80s.medium.webm -> ../.git/an
nex/objects/3j/vG/SHA256E-s108196923--7de9484ee96908268e21b451eb9805552c32b44da08e7
0ee861332c87352944f.webm/SHA256E-s108196923--7de9484ee96908268e21b451eb9805552c32b4
4da08e70ee861332c87352944f.webm

↳ git commit -m "Added video/deb-nicholson-80s.medium.webm"
[master efa1775] Added video/deb-nicholson-80s.medium.webm
 1 file changed, 1 insertion(+)
 create mode 120000 video/deb-nicholson-80s.medium.webm
\end{verbatim}
}

Now, the file \verb+deb-nicholson-80s.medium.webm+ is checked into
\verb+git-annex+ and we can now do a \verb+git annex sync+ to sync the
repository to other \verb+git-annex+ repositories. It must be noted
here that that when the repository is synced, the file content itself
is not transferred to the other \verb+git-annex+ repositories; only
the file's name and its meta-data that is stored in a separate git
branch called \verb+git-annex+ are
transferred\cite{documentation:git-annex-hworks}. In order to create a
copy of a given file in another git annex repository,
\verb+git annex get /path/to/filename.ext+ has to done.

\verb+git-annex+ has this feature called ``special
remotes''\cite{documentation:git-annex-sremotes}, that allows one to
push files checked into \verb+git-annex+ to storage provided by file
storage providers. At the time of writing this report,
\verb+git-annex+ supports pushing data to the following file storage
services:

{\scriptsize
\begin{itemize}
\item Amazon S3
\item Amazon Glacier
\item Internet Archive via S3
\item Box.com
\item Google drive
\item Google Cloud Storage
\item Mega.co.nz
\item SkyDrive
\item OwnCloud
\item Flickr
\item IMAP
\item Usenet
\item chef-vault
\item hubiC
\item pCloud
\item ipfs
\item Ceph
\item Blackblaze's B2
\end{itemize}
}

All data pushed to file storage provider's servers can optionally be
encrypted using one's GPG key. For instance, to encrypt data that is
pushed to the Amazon S3 special remote, following command is
used\cite{docs:git-annex-as3}:

\begin{verbatim}
$ git annex initremote cloud type=S3 keyid=2512E3C7
initremote cloud (encryption setup with gpg key C910D9222512E3C7) (checking bucket) (creating bucket in US) (gpg) ok
$ git annex describe cloud "at Amazon's US datacenter"
describe cloud ok
\end{verbatim}

where \verb+2512E3C7+ is the id of the GPG key to use for encrypting
data pushed to the Amazon S3 special remote. It is also possible to
store each file that is pushed to the remotes as a set of chunks of
size \verb+N+, to do that we do:

\begin{verbatim}
$ git annex initremote cloud type=S3 chunk=1MiB keyid=2512E3C7
initremote cloud (encryption setup with gpg key C910D9222512E3C7) (checking bucket) (creating bucket in US) (gpg) ok
$ git annex describe cloud "at Amazon's US datacenter"
describe cloud ok
\end{verbatim}

with that each file that has to be pushed to the Amazon S3 special
remote is divided into 1MiB chunks, each chunk is encrypted using the
GPG key \verb+2512E3C7+ and the encrypted chunks are finally pushed to
the Amazon S3 remote. It is must be noted here that unlike the Multi
Cloud Storage Prototype or SkyCDS or combox, in \verb+git-annex+ when
we are using file chunking all the chunks go to the same location --
in this case, the Amazon S3 remote.