combox-paper

notes and other things concerning combox
git clone git://git.ricketyspace.net/combox-paper.git
Log | Files | Refs

commit 2c136728999d7451d8eef2f202a08ec7bc524136
parent f20eb79289341ed649345a30aacd7cd07ba2e135
Author: Siddharth Ravikumar <sravik@bgsu.edu>
Date:   Fri, 26 Feb 2016 08:36:26 -0500

Moved around chapters.

Chapter 3 -> Chapter 2
Chapter 4 -> Chapter 3
Chapter 5 -> Chapter 4

Diffstat:
report/chapters/2-lit-r.tex | 234+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
report/chapters/3-arch-d.tex | 504+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
report/chapters/3-lit-r.tex | 234-------------------------------------------------------------------------------
report/chapters/4-arch-d.tex | 504-------------------------------------------------------------------------------
report/chapters/4-testing.tex | 669+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
report/chapters/5-testing.tex | 669-------------------------------------------------------------------------------
6 files changed, 1407 insertions(+), 1407 deletions(-)

diff --git a/report/chapters/2-lit-r.tex b/report/chapters/2-lit-r.tex @@ -0,0 +1,234 @@ +\chapter{Background and Literature Review} + +\epigraph{Books serve to show a man that those original thoughts of + his aren't very new after all}{\textit{Abraham Lincoln}} + +The idea of unifying the storage provided by multiple Internet file +storage providers and storing all the content in an encrypted form is +not new, computer researchers/scientists, programmers have devised +different methods to use multiple file storage providers' storage +space. This chapter gives an overview of the work done by Yeo et +al. in unifying the storage provided by Dropbox, Box, Google Drive and +Skydrive on Android devices\cite{yeo}(Section \ref{3-yeo-sec}); +SkyCDS, a content delivery service, by Gonzalez et al., which uses +publish/subscribe overly paradigm and stores the content across +multiple ``cloud'' storage providers such that only part of the +content (in encrypted form) is stored on each ``cloud'' storage +provider\cite{skycds}(Section \ref{3-skycds-sec}); lastly, +\verb+git-annex+, by Joey Hess\cite{person:joeyh}, that allows one to +version control and keep track of large files with a possibility of +encrypting files that are stored in ``special remotes'' -- storage +provided by Internet file storage providers (Section +\ref{3-gitannex-sec}). + +\section{Multi Cloud Storage Prototype}\label{3-yeo-sec} + +In their paper ``Leveraging client-side storage techniques for +enhanced use of multiple consumer cloud storage services on +resource-constrained mobile devices'', Yeo et al. show their Android +mobile application, a prototype, which unifies storage provided by +Dropbox, Box, Google Drive and SkyDrive. The application allows the +user to store all their information in a single location on their +phone and the application uses erasure coding\cite{weatherspoon} to +split each file into \verb`n + k` fragments and spreads the encrypted +fragments across storage provided by the file storage providers. All +basic file operations -- Create, Rename, Update, Delete (CRUD) -- are +possible. Information about the file stored in a unified location is +stored in a SQLite database. Unlike combox, which depends the file +storage provider' client to sync file fragments/shards to the file +storage provider's server, the android application developed by Yeo et +al. takes the responsibility to sync file fragments/shards to each +file storage provider and usesd the OAuth 2.0\cite{protocal:oauth2} +protocol for authorization. + +For encrypting file fragments, they use AES-256; they key for +encrypting is derived from the user's password by using Password-Based +Key Derivation Function (PBKDF2)\cite{kaliski}. For erasure coding +they use the JigDFS librarary\cite{jigdfs}. The android application is +able do ``progressive streaming'' of media files; this means that +large media files can be streamed in real-time from the from the file +storage providers' servers; this is an attractive feature in a +``resource contrained'' device where storage is expensive. + +Yeo et al. propose methods for achieving data de-duplication, file +fragment/shard compression based on the type of the file, intelligent +pre-fetching and caching for file fragrments and ``automatic +restoration in exploiting file-versioning''; these features were not +implemented in the prototype Android application and there is +possibility of Yeo et al. implementing these features in the future. + +It becomes that that Yeo et al. work is of immense importance when we +take into consideration the research done by Yang et al., which found +that 59\% of the users who use ``cloud storage service'' access the +service through a smart phone and 42.2\% users access +audio/video\cite{yang}. The research by Yang et al. definitely +suggests a trend of users' preference for small hand-held computers +over laptops and desktops. + +\section{SkyCDS}\label{3-skycds-sec} + +SkyCDS, by Gonzalez et al., is a content delivery system that splits +and spreads the content across multiple ``cloud'' storage +providers\cite{skycds}. According to Gonzalez et al., the main reason +for designing and developing SkyCDS was to prevent content providers +from getting locked into just one ``cloud'' storage provider and to +minimize loss when a ``cloud'' storage provider goes out of business +or if there is temporary outage in the storage service provided by the +``cloud'' storage provider. + +In SkyCDS the content delivery to subscribers of the content is +segregated into two distinct layers -- Metadata Flow Layer and the +Content Flow Layer. The publisher of the content largely interacts +with the Metadata Flow Layer that controls and keeps track of the what +content is published and the subscriber also largely interacts with +the Metadata Flow layer to subscribe to content published in the +content delivery system. The Content Flow Layer is where the content +is stored across multiple ``cloud'' storage providers. The publisher +is responsible for publishing the content using eth ``delivery +workflow'' (part of the Content Flow Layer) and the subscriber uses +the ``retrieve workflow'' to get access to the subscribed content. + +When content has to be dispersed to $k$ ``cloud'' storage providers, +the content is split into $n$ chunks, $n > k$, this file splitting +seems to produce 66.7\% of redundancy overhead\cite{skycds}; this file +splitting scheme looks very similar to erasure coding, but Gonzalez et +al. don't explicitly state that the content splitting scheme is indeed +``erasure coding''. The splitting of content is done by the ``delivery +workflow'' engine which is invoked when the publisher triggers the +action to publish the respective content to subscribers. + +To evaluate the effectiveness of SkyCDS, Gonzalez et al. state that +they've done a case study using the data (content) obtained from +European Space Astronomy Center (ESAC) for the Soil Moisture Ocean +Salinity. In this study, a group of organizations, in two different +continents, used SkyCDS to share satillete images with each +other. According to Gonzalez et al. this study attested SkyCDS as a +viable option for content delivery with respective to performance, +cost of ``cloud'' storage space and reliability. + +\section{git-annex}\label{3-gitannex-sec} + +\verb+git-annex+ allows one to version controlled large files that are +not usually feasible to version control under +\verb+git+\cite{program:git}. \verb+git-annex+, checks in the names +and other meta-data about the files in git and stores the actual +content under \verb+.git/annex+ directory. When a file is added to +\verb+git-annex+, a symlink of the file is created in place of th file +and the content of the file itself is stored under the +\verb+.git/annex+ directory. + +For instance, say there is a file called +\verb+deb-nicholson-80s.medium.webm+ was downloaded from the Internet +to the \verb+git-annex+ directory: + +\begin{verbatim} +↳ git status +On branch master +Untracked files: + (use "git add <file>..." to include in what will be committed) + + deb-nicholson-80s.medium.webm + +↳ ls -l +total 105708 +... +-rw-r--r-- 1 rsd rsd 108196923 May 5 2015 deb-nicholson-80s.medium.webm +... +\end{verbatim} + +When this file is added to \verb+git-annex+ with \verb+git annex add+, +the file turns into a symlink to a file under the \verb+.git/annex+ +directory: + +{\small +\begin{verbatim} +↳ git annex add deb-nicholson-80s.medium.webm +add deb-nicholson-80s.medium.webm ok +(recording state in git...) + +↳ ls -l +... +lrwxrwxrwx 1 rsd rsd 207 May 5 2015 deb-nicholson-80s.medium.webm -> ../.git/an +nex/objects/3j/vG/SHA256E-s108196923--7de9484ee96908268e21b451eb9805552c32b44da08e7 +0ee861332c87352944f.webm/SHA256E-s108196923--7de9484ee96908268e21b451eb9805552c32b4 +4da08e70ee861332c87352944f.webm + +↳ git commit -m "Added video/deb-nicholson-80s.medium.webm" +[master efa1775] Added video/deb-nicholson-80s.medium.webm + 1 file changed, 1 insertion(+) + create mode 120000 video/deb-nicholson-80s.medium.webm +\end{verbatim} +} + +Now, the file \verb+deb-nicholson-80s.medium.webm+ is checked into +\verb+git-annex+ and we can now do a \verb+git annex sync+ to sync the +repository to other \verb+git-annex+ repositories. It must be noted +here that that when the repository is synced, the file content itself +is not transferred to the other \verb+git-annex+ repositories; only +the file's name and its meta-data that is stored in a separate git +branch called \verb+git-annex+ are +transferred\cite{documentation:git-annex-hworks}. In order to create a +copy of a given file in another git annex repository, +\verb+git annex get /path/to/filename.ext+ has to done. + +\verb+git-annex+ has this feature called ``special +remotes''\cite{documentation:git-annex-sremotes}, that allows one to +push/copy data to checked into \verb+git-annex+ to storage provided by +``cloud'' storage providers. At the time of writing this report, +\verb+git-annex+ supports pushing data to the following file storage +services: + +{\scriptsize +\begin{itemize} +\item Amazon S3 +\item Amazon Glacier +\item Internet Archive via S3 +\item Box.com +\item Google drive +\item Google Cloud Storage +\item Mega.co.nz +\item SkyDrive +\item OwnCloud +\item Flickr +\item IMAP +\item Usenet +\item chef-vault +\item hubiC +\item pCloud +\item ipfs +\item Ceph +\item Blackblaze's B2 +\end{itemize} +} + +All data pushed to file storage provider's servers can be optionally +encrypted using one's GPG key. For instance, to encrypt data that is +pushed to the Amazon S3 special remote, following command is +used\cite{docs:git-annex-as3}: + +\begin{verbatim} +$ git annex initremote cloud type=S3 keyid=2512E3C7 +initremote cloud (encryption setup with gpg key C910D9222512E3C7) (checking bucket) (creating bucket in US) (gpg) ok +$ git annex describe cloud "at Amazon's US datacenter" +describe cloud ok +\end{verbatim} + +where \verb+2512E3C7+ is the id of the GPG key to use for encrypting +data pushed to the Amazon S3 special remote. It is also possible to +store each file that is pushed to the remotes as a set of chunks of +size \verb+N+, to do that we do: + +\begin{verbatim} +$ git annex initremote cloud type=S3 chunk=1MiB keyid=2512E3C7 +initremote cloud (encryption setup with gpg key C910D9222512E3C7) (checking bucket) (creating bucket in US) (gpg) ok +$ git annex describe cloud "at Amazon's US datacenter" +describe cloud ok +\end{verbatim} + +with that each file that has to be pushed to the Amazon S3 special +remote is divided into 1MiB chunks, each chunk is encrypted using the +GPG key \verb+2512E3C7+ and the encrypted chunks are finally pushed to +the Amazon S3 remote. It is must be noted here that unlike the Multi +Cloud Storage Prototype or SkyCDS or combox, in \verb+git-annex+ when +we are using file chunking all the chunks go to the same location -- +in this case, the Amazon S3 remote. diff --git a/report/chapters/3-arch-d.tex b/report/chapters/3-arch-d.tex @@ -0,0 +1,504 @@ +\chapter{Architecture and Design} + +\epigraph{In general, when modeling phenomena in science and + engineering, we begin with simplified, incomplete models. As we + examine things in greater detail, these simple models become + inadequate and must be replaced by more refined + models.}{\textit{Structure and Interpretation of Computer Programs, + Section 1.1.5}\cite{sicp}} + +\section{Structure of combox} + +combox consists of two main components -- the combox directory and the +node directories. The combox directory is the place where the user +stores all her files; the node directories are the directories under +which encrypted shards of the files (in the combox directory) are +scattered to. A node directory is the file storage provider's +directory, for instance, the Dropbox directory and the Google Drive +directory are node directories. + +When a file \verb+file.ext+ is created in the combox directory, combox +splits the \verb+file.ext+ into \verb+N+ shards, where \verb+N+ is the +number of node directories; if there are two node directories (Dropbox +directory and Google Drive driver), then 2 shards are created. Each +shard of the file is then encrypted and the encrypted shards are +spread evenly across the node directories; if there are two node +directories -- Dropbox directory and Google Drive directory -- combox +will create two encrypted shards of file \verb+file.ext+ -- +\verb+file.ext.shard0+, \verb+file.ext.shard1+ -- and place one +encrypted shard under the Dropbox directory and the other encrypted +shard under the Google Drive directory. Now, the Dropbox client and +the Google client will sync the respective shards that was place under +their directories to their servers. + +\begin{figure}[h] +\includegraphics[scale=0.6]{4-combox-structure} +\caption{High level view of combox on two computers.} +\label{fig:4-combox-structure} +\end{figure} + +Now, we can move to another computer and start combox on it. First, +the node clients (Dropbox client and the Google Drive client) will +sync the new encrypted shards to their respective directories. Once +the encrypted shards are synced to the node directories, combox will +pick the encrypted shards -- \verb+file.ext.shard0+, +\verb+file.ext.shard1+ -- decrypt them and reconstruct into +\verb+file.ext+ and place in the respective location under the combox +directory; figure \ref{fig:4-combox-structure} illustrates this. The +process is similar for file modification, deletion and rename/move. + +\subsection{combox configuration}\label{sec:4-combox-config} + +combox configuration triggers automatically when combox finds that it +is not configured on this computer. The combox configuration setups up +the combox directory; asks the user to point to the location of the +node directories; reads the key (passphrase) to be used to encrypt +file shards that are spread across the node directories. The combox +configuration is written to +\verb+$HOME/.combox/config.yaml+; this YAML configuration file can be +manually edited by the user. + +The \verb+config_cb+ function in the \verb+combox.config+ module is +responsible for carrying out the combox configuration. Prior to +version \verb+0.2.0+, the combox configuration was purely done through +the CLI, from \verb+0.2.0+ onwards, be default, the combox +configuration done through a graphical interface; it is still possible +to configure combox through the CLI with the \verb+--cli+ switch. + +A demo of combox configuration using the graphical interface on +GNU/Linux can be viewed at +\url{https://ricketyspace.net/combox/combox-config-gui-glued-gnu.webm}; +the same demo of combox configuration using the graphical interface on +OS X can be viewed at +\url{https://ricketyspace.net/combox/combox-config-gui-glued-osx.webm}. + +\subsection{combox directory monitor}\label{sec:4-combox-cdirm} + +combox directory monitor is an instance of +\verb+combox.events.ComboxDirMonitor+ monitoring the combox directory +for changes. When changes are made to the combox directory, the combox +directory monitor is responsible for correctly detecting the type of +change and doing the right thing at that instance of time. + +When a file is created in the combox directory, the combox directory +monitor will take that file, split it into \verb+N+ (equal to the +number of node directories) shards, encrypt the shards, spread the +encrypted shards to the node directories, and finally store the hash +of the file in the local combox database. + +When a file is modified in the combox directory, the combox directory +monitor will take that modified file, split it into \verb+N+ (equal to +the number of node directories) shards, encrypt the shards, spread the +encrypted shards to the node directories, and finally update the hash +of the file in the local combox database. + +When a file is deleted in the combox directory, the combox directory +monitor will remove the encrypted shards of the file in the node +directories and get rid of the file's hash from the local combox +database. + +When a file is moved/renamed in the combox directory, the combox +directory monitor will move/rename encrypted in the node directories, +the file's hash from the local combox database and store the hash of +file under its new name. + +\subsection{Node directory monitor}\label{sec:4-combox-nodirm} + +Node directory monitor is an instance of +\verb+combox.events.NodeDirMonitor+ monitoring a node directory. When +changes are made to the node directory, the node directory monitor is +responsible for correctly detecting the type of change and doing the +right thing at that instance of time. Each node directory has a +dedicated node directory monitor; if there are 2 node directories, +then combox will instantiate 2 node directory monitors. + +When an encrypted shard is created in the node directory due to a file +created on another computer, the node directory first checks if the +respective file' encrypted shard(s) has/have arrived in other node +directory/directories. If all encrypted shards have arrived, then the +node directory takes all the encrypted shards, decrypts them, +reconstructs the file and creates the file in the combox directory of +this computer and stores the hash of the newly created file in the +local combox database. If the all the encrypted shards have not +arrived, then the node directory does not do anything. It must be +observed here that the node directory monitor of the last node +directory which gets the encrypted shard will be the one to perform +the file reconstruction and creation. + +When an encrypted shard is modified in the node directory due to a +file modified on another computer, the node directory first checks if +the respective file' modified encrypted shard(s) has/have arrived in +other node directory/directories. If all modified encrypted shards +have arrived, then the node directory takes all the modified encrypted +shards, decrypts them, reconstructs the file and puts the modified +version of the file in the combox directory of this computer and +updates the file's hash in the local combox database. If the all the +modified encrypted shards have not arrived, then the node directory +does not do anything. It must be observed here that the node directory +monitor of the last node directory which gets the modified encrypted +shard will be the one to perform the file reconstruction and will place +the modified file in the combox directory. + +When an encrypted shard is deleted in the node directory due to a file +deleted on another computer, the node directory first checks if the +respective file' encrypted shard(s) has/have been deleted in other +node directory/directories. If all encrypted shards have been deleted +from the node directories, then the node directory deletes the file in +the combox directory of this computer and removes its information from +the local combox database. If all encrypted shards have not been +deleted, then the node directory does not do anything. It must be +observed here that the node directory monitor of the last node +directory in which the encrypted shard is deleted will be the one to +delete the file from the combox directory. + +When an encrypted shard is moved/renamed in the node directory due to +a file moved/renamed on another computer, the node directory first +checks if the respective file' moved/renamed encrypted shard(s) +has/have arrived in other node directory/directories. If all +moved/renamed encrypted shards have arrived, then the node directory +takes all the moved/renamed encrypted shards, decrypts them, +reconstructs the moved/renamed file and puts the moved/renamed the file +in the combox directory of this computer and stores the hash under the +file' new name in the local combox database. If the all the +moved/renamed encrypted shards have not arrived, then the node +directory does not do anything. It must be observed here that the node +directory monitor of the last node directory which gets the +moved/renamed encrypted shard will be the one to perform the file +reconstruction and will place the moved/renamed file in the combox +directory. + +\subsection{Database structure}\label{sec:4-combox-db} + +To keep it simple, stupid, I decide to maintain bare minimum +information about files, stored in the combox directory, and depend on +file system events to do the right thing when changes takes place in +the combox directory. + +The only information that is stored in the database, about a file in +the combox directory is its SHA-512 hash; The SHA-512 hash of a file +is enough information to detect in the file. In the database, there +also four dictionaries -- \verb+file_moved+, \verb+file_deleted+, +\verb+file_created+, \verb+file_modified+ -- which tracks the number +of shards of a file that was moved/deleted/created/modified due the +respective file being moved/deleted/created/modified on another +computer; these four dictionaries are primarily used by the +\verb+NodeDirMonitor+ to detect remote file +movement/deletion/creation/modification and triggering file +reconstruction from shards at the right time. + +The database is a JSON file on the disk, stored by default at +\verb+$HOME/.combox/silo.db+. The +\verb+combox.silo.ComboxSilo+\cite{combox-src:silo.ComboxSilo} is the +sole interface to read from and write to database. The database is +primarily accessed and modified by the combox directory monitor +(\verb+ComboxDirMonitor+) and the node directory monitor +(\verb+NodeDirMonitor+) through a shared Lock\cite{py:threading.Lock} +that ensures that only one entity\footnote{An entity can be the combox + directory monitor or one of the node directory monitors} can +access/modify the database at a time. + +Below is an illustration of the structure of the combox database: + +\begin{verbatim} +{ + "/home/rsd/combox/ipsum.txt": "e3206df2bb2b3091103ab9d...", + "/home/rsd/combox/tk-shot-osx.png": "7fcf1b44c15dd95e0...", + "/home/rsd/combox/thgttg-21st.png": "0040eedfc3eeab546...", + "/home/rsd/combox/lorem.txt": "5851dd7a4870ff165facb71...", + "/home/rsd/combox/the-red-star.jpg": "4b818126d882e552...", + "file_moved": {}, + "file_deleted": {}, + "file_created": {}, + "file_modified": {}, +} +\end{verbatim} + +The \verb+combox.silo.ComboxSilo+, which is the sole interface to read +from and write to the database, uses the pickleDB +library\cite{pylib:pickledb}. The pickleDB is a very basic key-value +store which allows one to store information in the JSON format; if I +would have not found this library or if this library was never by +Harrison Erd, I've would have written something very similar to this +library as part of combox to realize the basic key-value storage that +is needed to track the hashes of the files stored in the combox +directory. + +It must be noted that the combox database stored on each computer is +independent and does not communicate or make transactions with the +combox databases located in other computers. + +\section{combox modules overview} + +combox is spread into modules that have functions and/or classes. As +of \verb+2016-02-04+ combox is considerably a small program: + +\begin{verbatim} +$ wc -l combox/*.py + 144 combox/cbox.py + 178 combox/config.py + 241 combox/crypto.py + 891 combox/events.py + 541 combox/file.py + 454 combox/gui.py + 0 combox/__init__.py + 71 combox/log.py + 278 combox/silo.py + 29 combox/_version.py + 2827 total +\end{verbatim} + +This section gives an overview of each of the combox modules with +extreme brevity: + +\begin{description} +\item[combox.cbox] This module contains \verb+run_cb+ function runs + combox; it creates an instance \verb+threading.Lock+ for database + access and a shared \verb+threading.Lock+ for the + \verb+combox.events.ComboxDirMonitor+ and + \verb+combox.events.NodeDirMonitor+; it initializes an instance + \verb+combox.events.ComboxDirMonitor+ that monitors the combox + directory and an instance of \verb+combox.events.NodeDirMonitor+ for + each node directory for monitoring the node directories. This + modules also houses the \verb+main+ function that parses commandline + arguments, starts combox configuration if needed or loads the combox + configuration file to start running combox. +\item[combox.config] Accomodates two import functions -- + \verb+config_cb+ and \verb+get_nodedirs+. The \verb+config_cb+ is + the combox configuration function that allows the user to configure + combox; this function was designed in a such way that it was + possible to use for both CLI and GUI methods of configuring + combox. The \verb+get_nodedirs+ function returns, as a list, the + paths of the node directories; this function use used in numerous + places in other combox modules. +\item[combox.crypto] This has functions for encrypting and decrypting + data; encrypting and decrypting shards (\verb+encrypt_shards+ and + \verb+decrypt_shards+); a function for splitting a file into shards, + encrypting those shards and spreading them across node directories + (\verb+split_and_encrypt+); a function for decrypting the shards + from the node directories, reconstructing the file from the + decrypted shards and put the file back to the combox directory + (\verb+decrypt_and_glue+). Functions \verb+split_and_encrypt+ and + \verb+decrypt_and_glue+ are the two functions that that are + extensively used by the \verb+combox.events+ module; all other + functions in this module are pretty much helper functions are + \verb+split_and_encrypt+ and \verb+decrypt_and_glue+ functions and + are not used by other modules. +\item[combox.events] This module took the most time to write and test + and it is the most complex module in combox at the time of writing + this report. It contains just two classes -- \verb+ComboxDirMonitor+ + and \verb+NodeDirMonitor+. The \verb+ComboxDirMonitor+ inherits the + \verb+watchdog.events.LoggingEventHandler+ and is responsible for + monitoring for changes in the combox directory and doing the right + thing when change happens in the combox directory. The + \verb+NodeDirMonitor+ also inherits + \verb+watchdog.events.LoggingEventHandler+ and similarly responsible + for monitoring a node directory and doing the right thing when a + change happens in the node directory; subjectively, + \verb+NodeDirMonitor+ is slightly more complex than the + \verb+ComboxDirMonitor+. +\item[combox.file] This is the second largest module in combox. It + contains utility functions for reading, writing, moving + files/directiores, hashing files, splitting a file into shards, glue + shards into a file, manipulating directories inside combox and node + directories. +\item[combox.gui] Contains the \verb+ComboxConfigDialog+ class; it is + the graphical interface for configuring combox. The class uses the + Tkinter library\cite{pylib:tkinter} for spawing graphical + elements. Other graphical libraries include PyQt\cite{pylib:qt} + were considered Tkinter was chosen over others because it works on + all Unix systems and Microsoft's Windows and it is part of the core + python (version 3). +\item[combox.log] All the messages to \verb+stdout+ and \verb+stderr+ + are sent through the functions \verb+log_i+ and \verb+log_e+ + functions defined in this module. +\item[combox.silo] Contains the \verb+ComboxSilo+ class which is the + canonical interface for combox for managing information about the + files in the combox directory. Internally, the \verb+ComboxSilo+ + class uses the pickleDB library\cite{pylib:pickledb}. +\item[combox.\_version] This is \emph{private} module that contains + variables that contain the value of the present version and release + of combox. The \verb+get_version+ function in this module returns + the full version number; this function used by \verb+setup.py+. +\end{description} + +\section{Language choice} + +Back in October of 2014, I was learning to write in Python and when I +had to start working on combox, I chose to write combox in Python. In +my first commit to the combox repository, I had to say this about +Python: + +\begin{verbatim} +commit 2def977472b2e77ee88c9177f2d03f12b0263eb0 +Author: rsiddharth <rsiddharth@ninthfloor.org> +Date: Wed Oct 29 23:24:58 2014 -0400 + + Initial commit: File splitter & File gluer done. + + ... + + I like to write python FWIW. But after reading a dialect of Lisp when + I come back to python, it does not look very beautiful. I guess I'm + pretty convinced that there is no language that can ape the beauty of + Lisp. +\end{verbatim} + +If I were to write that commit message today (\verb+2016-02-04+), I +would've phrased my reflections about Python differently. While I've +not found a language that is as intrinsically beautiful as Lisp, I +think it is not quite right to compare Lisp and Python. Python is a +very readable language and it tends to be very accessible to +beginners. Also, it is hard to write unreadable Python code. + +\section{DRY} + +The core functionality of combox is to split, encrypt file shards, +spread them across node directories (Google Drive and Dropbox) and +decrypt, glue shards and put them back to the combox directory when a +file is created/modified/deleted/moved in another computer. The plan +was to use external libraries to accomplish things that fell outside +the realm of what I consider the ``core functionality of combox''; the +main reason behind this decision was to duly be an indolent programmer +and not indulge in trying to solve problems that others have already +solved. + +The \verb+watchdog+\cite{pylib:watchdog} library was chosen for file +monitoring; this library is compatible with Unix systems and +Windows. The \verb+pycrypto+ library\cite{pylib:pycrypto} was used for +encrypting data; combox uses AES encryption scheme to encrypt file +shards. The \verb+pickleDB+\cite{pylib:pickledb} library was used to +store information about files in the combox directory; this library is +not very clean, but, it was what I exactly looking for, if there was +no \verb+pickleDB+, I would've most probably written something similar +to it and made it as part of combox. + +Looking back, the decision to use external libraries reduced the +complexity of combox, reduced the time to complete the initial working +version of combox and made it possible to spend more than 3 months +just testing and fixing issues in combox. + +\section{Operating system compatibility}\label{4-os-compat} + +combox was developed on a GNU/Linux machine, a conscious effort was +made to write in an operating system independent way. The top criteria +for choosing a library to use in combox was that it had to be +compatible on \emph{all} of the three major computing platforms in +2014-2016\footnote{GNU/Linux, OS X and, Windows}. + +As we were nearing the \verb+0.1.0+ release, combox was tested on OS X +(See chapter \ref{ch:5}) and OS X specific issues that were found +eventually were eventually fixed. The initial \verb+0.1.0+ release was +compatible with GNU/Linux and OS X. + +After the initial release of combox, we wanted to see if combox would +be compatible with Windows. We found that: + +\begin{itemize} +\item Setting up the parapharnalia to run combox was + non-trivial\cite{doc:combox-setup-windoze}. +\item The unit tests for the \verb+combox.file+ module royally failed. +\end{itemize} + +At the time of writing the report, combox is in version \verb+0.2.2+ +and it still not compatible with Windows. Comprehensive documentation +of setting up the development environment for combox on Windows was +written\cite{doc:combox-setup-windoze} to make it less cumbersome for +anyone who would want to work on making combox compatible with +Windows. + +\section{combox as a python package}\label{4-pypi} + +Before version \verb+0.2.0+, the canonical way to install combox was +to pull the source from the \verb+git+ repository with: + +\begin{verbatim} + git clone git://ricketyspace.net/combox.git +\end{verbatim} + +Then, do: + +\begin{verbatim} + cd combox +\end{verbatim} + +Finally install combox with: + +\begin{verbatim} + python setup.py install +\end{verbatim} + +Yes, installing combox on a machine was indeed non-trivial. + +Python has a package registry called CheeseShop\footnote{code name for + Python Package Index, see https://wiki.python.org/moin/CheeseShop}; +all packages registered at the CheeseShop can be installed using +\verb+pip+ -- Python's platform independent package managment +system\cite{py:pip} -- with: + +\begin{verbatim} + pip install packagename +\end{verbatim} + +To make it easier for (python) users to install combox on their +machine, an effort was made to make it a python +package\cite{py:package-guide}. From version \verb+0.2.0+, combox has +been registered python package at the CheeseShop. (Python) users can +now easily get a copy of combox on their machine with: + +\begin{verbatim} + pip install combox +\end{verbatim} + +All versions of combox that is available through the CheeseShop are +digitally signed using the following GPG key: + +\begin{verbatim} +pub 4096R/00B252AF 2014-09-08 [expires: 2017-09-07] + Key fingerprint = C174 1162 CEED 5FE8 9954 A4B9 9DF9 7838 00B2 52AF +uid Siddharth Ravikumar (sravik) <sravik@bgsu.edu> +sub 4096R/09CECEDB 2014-09-08 [expires: 2017-09-07] +\end{verbatim} + +All versions of combox's source are also available as a compressed +\verb+TAR+ ball and as a \verb+ZIP+ archive; they can be downloaded +from \url{https://ricketyspace.net/combox/releases.html}. + +\section{With the benefit of hindsight}\label{4-hindsight} + +combox's node monitor (\verb+combox.events.NodeDirMonitor+) was +written with the assumption that the node monitor will be the only +entity that will be making changes to the node directory that it is +monitoring. When started testing combox with node clients (Dropbox +client and Google Drive client), we observed that the node clients +made changes to the node directory when a file was +created/modified/renamed/deleted; for instance, when a shard, in the +Dropbox node directory, was modified on a remote computer, the Dropbox +client would first pull the newer version of the shard under the +\verb+.dropbox.cache+ directory as a temprorary file, move the older +version of the shard under \verb+.dropbox.cache+ as a backup, and +finally move the latest version of the shard, stored as a temprorary +file under the \verb+.dropbox.cache+ directory, to the respective +location in the Dropbox node directory; when a shard, in the Google +Drive node directory, was remotely modified on a remote computer, the +Google Drive client would delete the older version of the shard from +the Google Drive node directory and then create the newer version of +the shard in the respective location under the Google Drive node +directory. Since combox did not know about the node client's +behaviour, it confused combox and broke it royally; we had to make +major changes to the \verb+combox.eventns.NodeDirMonitor+ class to +make combox aware of the node client's behavior, this eventually +brutally obliterated the simplicity of the +\verb+combox.eventns.NodeDirMonitor+ class which I was proud of. + +I'm not sure how I would have written the \verb+combox.events+ module +if I had known about the Dropbox and Google Drive client's behaviour +before writing the \verb+combox.events.NodeDirMonitor+ or the +\verb+combox.events.ComboxDirMonitor+ classes. Looking back, if there one +thing I would want to re-think/redo, it is the \verb+combox.events+ +module. + +The most important lesson I'm taking away from the experience of +writing combox is the insight of how easy it is to ruthless crush the +simplicity of a program due to unforeseen use cases. + +\verb+<3+ diff --git a/report/chapters/3-lit-r.tex b/report/chapters/3-lit-r.tex @@ -1,234 +0,0 @@ -\chapter{Background and Literature Review} - -\epigraph{Books serve to show a man that those original thoughts of - his aren't very new after all}{\textit{Abraham Lincoln}} - -The idea of unifying the storage provided by multiple Internet file -storage providers and storing all the content in an encrypted form is -not new, computer researchers/scientists, programmers have devised -different methods to use multiple file storage providers' storage -space. This chapter gives an overview of the work done by Yeo et -al. in unifying the storage provided by Dropbox, Box, Google Drive and -Skydrive on Android devices\cite{yeo}(Section \ref{3-yeo-sec}); -SkyCDS, a content delivery service, by Gonzalez et al., which uses -publish/subscribe overly paradigm and stores the content across -multiple ``cloud'' storage providers such that only part of the -content (in encrypted form) is stored on each ``cloud'' storage -provider\cite{skycds}(Section \ref{3-skycds-sec}); lastly, -\verb+git-annex+, by Joey Hess\cite{person:joeyh}, that allows one to -version control and keep track of large files with a possibility of -encrypting files that are stored in ``special remotes'' -- storage -provided by Internet file storage providers (Section -\ref{3-gitannex-sec}). - -\section{Multi Cloud Storage Prototype}\label{3-yeo-sec} - -In their paper ``Leveraging client-side storage techniques for -enhanced use of multiple consumer cloud storage services on -resource-constrained mobile devices'', Yeo et al. show their Android -mobile application, a prototype, which unifies storage provided by -Dropbox, Box, Google Drive and SkyDrive. The application allows the -user to store all their information in a single location on their -phone and the application uses erasure coding\cite{weatherspoon} to -split each file into \verb`n + k` fragments and spreads the encrypted -fragments across storage provided by the file storage providers. All -basic file operations -- Create, Rename, Update, Delete (CRUD) -- are -possible. Information about the file stored in a unified location is -stored in a SQLite database. Unlike combox, which depends the file -storage provider' client to sync file fragments/shards to the file -storage provider's server, the android application developed by Yeo et -al. takes the responsibility to sync file fragments/shards to each -file storage provider and usesd the OAuth 2.0\cite{protocal:oauth2} -protocol for authorization. - -For encrypting file fragments, they use AES-256; they key for -encrypting is derived from the user's password by using Password-Based -Key Derivation Function (PBKDF2)\cite{kaliski}. For erasure coding -they use the JigDFS librarary\cite{jigdfs}. The android application is -able do ``progressive streaming'' of media files; this means that -large media files can be streamed in real-time from the from the file -storage providers' servers; this is an attractive feature in a -``resource contrained'' device where storage is expensive. - -Yeo et al. propose methods for achieving data de-duplication, file -fragment/shard compression based on the type of the file, intelligent -pre-fetching and caching for file fragrments and ``automatic -restoration in exploiting file-versioning''; these features were not -implemented in the prototype Android application and there is -possibility of Yeo et al. implementing these features in the future. - -It becomes that that Yeo et al. work is of immense importance when we -take into consideration the research done by Yang et al., which found -that 59\% of the users who use ``cloud storage service'' access the -service through a smart phone and 42.2\% users access -audio/video\cite{yang}. The research by Yang et al. definitely -suggests a trend of users' preference for small hand-held computers -over laptops and desktops. - -\section{SkyCDS}\label{3-skycds-sec} - -SkyCDS, by Gonzalez et al., is a content delivery system that splits -and spreads the content across multiple ``cloud'' storage -providers\cite{skycds}. According to Gonzalez et al., the main reason -for designing and developing SkyCDS was to prevent content providers -from getting locked into just one ``cloud'' storage provider and to -minimize loss when a ``cloud'' storage provider goes out of business -or if there is temporary outage in the storage service provided by the -``cloud'' storage provider. - -In SkyCDS the content delivery to subscribers of the content is -segregated into two distinct layers -- Metadata Flow Layer and the -Content Flow Layer. The publisher of the content largely interacts -with the Metadata Flow Layer that controls and keeps track of the what -content is published and the subscriber also largely interacts with -the Metadata Flow layer to subscribe to content published in the -content delivery system. The Content Flow Layer is where the content -is stored across multiple ``cloud'' storage providers. The publisher -is responsible for publishing the content using eth ``delivery -workflow'' (part of the Content Flow Layer) and the subscriber uses -the ``retrieve workflow'' to get access to the subscribed content. - -When content has to be dispersed to $k$ ``cloud'' storage providers, -the content is split into $n$ chunks, $n > k$, this file splitting -seems to produce 66.7\% of redundancy overhead\cite{skycds}; this file -splitting scheme looks very similar to erasure coding, but Gonzalez et -al. don't explicitly state that the content splitting scheme is indeed -``erasure coding''. The splitting of content is done by the ``delivery -workflow'' engine which is invoked when the publisher triggers the -action to publish the respective content to subscribers. - -To evaluate the effectiveness of SkyCDS, Gonzalez et al. state that -they've done a case study using the data (content) obtained from -European Space Astronomy Center (ESAC) for the Soil Moisture Ocean -Salinity. In this study, a group of organizations, in two different -continents, used SkyCDS to share satillete images with each -other. According to Gonzalez et al. this study attested SkyCDS as a -viable option for content delivery with respective to performance, -cost of ``cloud'' storage space and reliability. - -\section{git-annex}\label{3-gitannex-sec} - -\verb+git-annex+ allows one to version controlled large files that are -not usually feasible to version control under -\verb+git+\cite{program:git}. \verb+git-annex+, checks in the names -and other meta-data about the files in git and stores the actual -content under \verb+.git/annex+ directory. When a file is added to -\verb+git-annex+, a symlink of the file is created in place of th file -and the content of the file itself is stored under the -\verb+.git/annex+ directory. - -For instance, say there is a file called -\verb+deb-nicholson-80s.medium.webm+ was downloaded from the Internet -to the \verb+git-annex+ directory: - -\begin{verbatim} -↳ git status -On branch master -Untracked files: - (use "git add <file>..." to include in what will be committed) - - deb-nicholson-80s.medium.webm - -↳ ls -l -total 105708 -... --rw-r--r-- 1 rsd rsd 108196923 May 5 2015 deb-nicholson-80s.medium.webm -... -\end{verbatim} - -When this file is added to \verb+git-annex+ with \verb+git annex add+, -the file turns into a symlink to a file under the \verb+.git/annex+ -directory: - -{\small -\begin{verbatim} -↳ git annex add deb-nicholson-80s.medium.webm -add deb-nicholson-80s.medium.webm ok -(recording state in git...) - -↳ ls -l -... -lrwxrwxrwx 1 rsd rsd 207 May 5 2015 deb-nicholson-80s.medium.webm -> ../.git/an -nex/objects/3j/vG/SHA256E-s108196923--7de9484ee96908268e21b451eb9805552c32b44da08e7 -0ee861332c87352944f.webm/SHA256E-s108196923--7de9484ee96908268e21b451eb9805552c32b4 -4da08e70ee861332c87352944f.webm - -↳ git commit -m "Added video/deb-nicholson-80s.medium.webm" -[master efa1775] Added video/deb-nicholson-80s.medium.webm - 1 file changed, 1 insertion(+) - create mode 120000 video/deb-nicholson-80s.medium.webm -\end{verbatim} -} - -Now, the file \verb+deb-nicholson-80s.medium.webm+ is checked into -\verb+git-annex+ and we can now do a \verb+git annex sync+ to sync the -repository to other \verb+git-annex+ repositories. It must be noted -here that that when the repository is synced, the file content itself -is not transferred to the other \verb+git-annex+ repositories; only -the file's name and its meta-data that is stored in a separate git -branch called \verb+git-annex+ are -transferred\cite{documentation:git-annex-hworks}. In order to create a -copy of a given file in another git annex repository, -\verb+git annex get /path/to/filename.ext+ has to done. - -\verb+git-annex+ has this feature called ``special -remotes''\cite{documentation:git-annex-sremotes}, that allows one to -push/copy data to checked into \verb+git-annex+ to storage provided by -``cloud'' storage providers. At the time of writing this report, -\verb+git-annex+ supports pushing data to the following file storage -services: - -{\scriptsize -\begin{itemize} -\item Amazon S3 -\item Amazon Glacier -\item Internet Archive via S3 -\item Box.com -\item Google drive -\item Google Cloud Storage -\item Mega.co.nz -\item SkyDrive -\item OwnCloud -\item Flickr -\item IMAP -\item Usenet -\item chef-vault -\item hubiC -\item pCloud -\item ipfs -\item Ceph -\item Blackblaze's B2 -\end{itemize} -} - -All data pushed to file storage provider's servers can be optionally -encrypted using one's GPG key. For instance, to encrypt data that is -pushed to the Amazon S3 special remote, following command is -used\cite{docs:git-annex-as3}: - -\begin{verbatim} -$ git annex initremote cloud type=S3 keyid=2512E3C7 -initremote cloud (encryption setup with gpg key C910D9222512E3C7) (checking bucket) (creating bucket in US) (gpg) ok -$ git annex describe cloud "at Amazon's US datacenter" -describe cloud ok -\end{verbatim} - -where \verb+2512E3C7+ is the id of the GPG key to use for encrypting -data pushed to the Amazon S3 special remote. It is also possible to -store each file that is pushed to the remotes as a set of chunks of -size \verb+N+, to do that we do: - -\begin{verbatim} -$ git annex initremote cloud type=S3 chunk=1MiB keyid=2512E3C7 -initremote cloud (encryption setup with gpg key C910D9222512E3C7) (checking bucket) (creating bucket in US) (gpg) ok -$ git annex describe cloud "at Amazon's US datacenter" -describe cloud ok -\end{verbatim} - -with that each file that has to be pushed to the Amazon S3 special -remote is divided into 1MiB chunks, each chunk is encrypted using the -GPG key \verb+2512E3C7+ and the encrypted chunks are finally pushed to -the Amazon S3 remote. It is must be noted here that unlike the Multi -Cloud Storage Prototype or SkyCDS or combox, in \verb+git-annex+ when -we are using file chunking all the chunks go to the same location -- -in this case, the Amazon S3 remote. diff --git a/report/chapters/4-arch-d.tex b/report/chapters/4-arch-d.tex @@ -1,504 +0,0 @@ -\chapter{Architecture and Design} - -\epigraph{In general, when modeling phenomena in science and - engineering, we begin with simplified, incomplete models. As we - examine things in greater detail, these simple models become - inadequate and must be replaced by more refined - models.}{\textit{Structure and Interpretation of Computer Programs, - Section 1.1.5}\cite{sicp}} - -\section{Structure of combox} - -combox consists of two main components -- the combox directory and the -node directories. The combox directory is the place where the user -stores all her files; the node directories are the directories under -which encrypted shards of the files (in the combox directory) are -scattered to. A node directory is the file storage provider's -directory, for instance, the Dropbox directory and the Google Drive -directory are node directories. - -When a file \verb+file.ext+ is created in the combox directory, combox -splits the \verb+file.ext+ into \verb+N+ shards, where \verb+N+ is the -number of node directories; if there are two node directories (Dropbox -directory and Google Drive driver), then 2 shards are created. Each -shard of the file is then encrypted and the encrypted shards are -spread evenly across the node directories; if there are two node -directories -- Dropbox directory and Google Drive directory -- combox -will create two encrypted shards of file \verb+file.ext+ -- -\verb+file.ext.shard0+, \verb+file.ext.shard1+ -- and place one -encrypted shard under the Dropbox directory and the other encrypted -shard under the Google Drive directory. Now, the Dropbox client and -the Google client will sync the respective shards that was place under -their directories to their servers. - -\begin{figure}[h] -\includegraphics[scale=0.6]{4-combox-structure} -\caption{High level view of combox on two computers.} -\label{fig:4-combox-structure} -\end{figure} - -Now, we can move to another computer and start combox on it. First, -the node clients (Dropbox client and the Google Drive client) will -sync the new encrypted shards to their respective directories. Once -the encrypted shards are synced to the node directories, combox will -pick the encrypted shards -- \verb+file.ext.shard0+, -\verb+file.ext.shard1+ -- decrypt them and reconstruct into -\verb+file.ext+ and place in the respective location under the combox -directory; figure \ref{fig:4-combox-structure} illustrates this. The -process is similar for file modification, deletion and rename/move. - -\subsection{combox configuration}\label{sec:4-combox-config} - -combox configuration triggers automatically when combox finds that it -is not configured on this computer. The combox configuration setups up -the combox directory; asks the user to point to the location of the -node directories; reads the key (passphrase) to be used to encrypt -file shards that are spread across the node directories. The combox -configuration is written to -\verb+$HOME/.combox/config.yaml+; this YAML configuration file can be -manually edited by the user. - -The \verb+config_cb+ function in the \verb+combox.config+ module is -responsible for carrying out the combox configuration. Prior to -version \verb+0.2.0+, the combox configuration was purely done through -the CLI, from \verb+0.2.0+ onwards, be default, the combox -configuration done through a graphical interface; it is still possible -to configure combox through the CLI with the \verb+--cli+ switch. - -A demo of combox configuration using the graphical interface on -GNU/Linux can be viewed at -\url{https://ricketyspace.net/combox/combox-config-gui-glued-gnu.webm}; -the same demo of combox configuration using the graphical interface on -OS X can be viewed at -\url{https://ricketyspace.net/combox/combox-config-gui-glued-osx.webm}. - -\subsection{combox directory monitor}\label{sec:4-combox-cdirm} - -combox directory monitor is an instance of -\verb+combox.events.ComboxDirMonitor+ monitoring the combox directory -for changes. When changes are made to the combox directory, the combox -directory monitor is responsible for correctly detecting the type of -change and doing the right thing at that instance of time. - -When a file is created in the combox directory, the combox directory -monitor will take that file, split it into \verb+N+ (equal to the -number of node directories) shards, encrypt the shards, spread the -encrypted shards to the node directories, and finally store the hash -of the file in the local combox database. - -When a file is modified in the combox directory, the combox directory -monitor will take that modified file, split it into \verb+N+ (equal to -the number of node directories) shards, encrypt the shards, spread the -encrypted shards to the node directories, and finally update the hash -of the file in the local combox database. - -When a file is deleted in the combox directory, the combox directory -monitor will remove the encrypted shards of the file in the node -directories and get rid of the file's hash from the local combox -database. - -When a file is moved/renamed in the combox directory, the combox -directory monitor will move/rename encrypted in the node directories, -the file's hash from the local combox database and store the hash of -file under its new name. - -\subsection{Node directory monitor}\label{sec:4-combox-nodirm} - -Node directory monitor is an instance of -\verb+combox.events.NodeDirMonitor+ monitoring a node directory. When -changes are made to the node directory, the node directory monitor is -responsible for correctly detecting the type of change and doing the -right thing at that instance of time. Each node directory has a -dedicated node directory monitor; if there are 2 node directories, -then combox will instantiate 2 node directory monitors. - -When an encrypted shard is created in the node directory due to a file -created on another computer, the node directory first checks if the -respective file' encrypted shard(s) has/have arrived in other node -directory/directories. If all encrypted shards have arrived, then the -node directory takes all the encrypted shards, decrypts them, -reconstructs the file and creates the file in the combox directory of -this computer and stores the hash of the newly created file in the -local combox database. If the all the encrypted shards have not -arrived, then the node directory does not do anything. It must be -observed here that the node directory monitor of the last node -directory which gets the encrypted shard will be the one to perform -the file reconstruction and creation. - -When an encrypted shard is modified in the node directory due to a -file modified on another computer, the node directory first checks if -the respective file' modified encrypted shard(s) has/have arrived in -other node directory/directories. If all modified encrypted shards -have arrived, then the node directory takes all the modified encrypted -shards, decrypts them, reconstructs the file and puts the modified -version of the file in the combox directory of this computer and -updates the file's hash in the local combox database. If the all the -modified encrypted shards have not arrived, then the node directory -does not do anything. It must be observed here that the node directory -monitor of the last node directory which gets the modified encrypted -shard will be the one to perform the file reconstruction and will place -the modified file in the combox directory. - -When an encrypted shard is deleted in the node directory due to a file -deleted on another computer, the node directory first checks if the -respective file' encrypted shard(s) has/have been deleted in other -node directory/directories. If all encrypted shards have been deleted -from the node directories, then the node directory deletes the file in -the combox directory of this computer and removes its information from -the local combox database. If all encrypted shards have not been -deleted, then the node directory does not do anything. It must be -observed here that the node directory monitor of the last node -directory in which the encrypted shard is deleted will be the one to -delete the file from the combox directory. - -When an encrypted shard is moved/renamed in the node directory due to -a file moved/renamed on another computer, the node directory first -checks if the respective file' moved/renamed encrypted shard(s) -has/have arrived in other node directory/directories. If all -moved/renamed encrypted shards have arrived, then the node directory -takes all the moved/renamed encrypted shards, decrypts them, -reconstructs the moved/renamed file and puts the moved/renamed the file -in the combox directory of this computer and stores the hash under the -file' new name in the local combox database. If the all the -moved/renamed encrypted shards have not arrived, then the node -directory does not do anything. It must be observed here that the node -directory monitor of the last node directory which gets the -moved/renamed encrypted shard will be the one to perform the file -reconstruction and will place the moved/renamed file in the combox -directory. - -\subsection{Database structure}\label{sec:4-combox-db} - -To keep it simple, stupid, I decide to maintain bare minimum -information about files, stored in the combox directory, and depend on -file system events to do the right thing when changes takes place in -the combox directory. - -The only information that is stored in the database, about a file in -the combox directory is its SHA-512 hash; The SHA-512 hash of a file -is enough information to detect in the file. In the database, there -also four dictionaries -- \verb+file_moved+, \verb+file_deleted+, -\verb+file_created+, \verb+file_modified+ -- which tracks the number -of shards of a file that was moved/deleted/created/modified due the -respective file being moved/deleted/created/modified on another -computer; these four dictionaries are primarily used by the -\verb+NodeDirMonitor+ to detect remote file -movement/deletion/creation/modification and triggering file -reconstruction from shards at the right time. - -The database is a JSON file on the disk, stored by default at -\verb+$HOME/.combox/silo.db+. The -\verb+combox.silo.ComboxSilo+\cite{combox-src:silo.ComboxSilo} is the -sole interface to read from and write to database. The database is -primarily accessed and modified by the combox directory monitor -(\verb+ComboxDirMonitor+) and the node directory monitor -(\verb+NodeDirMonitor+) through a shared Lock\cite{py:threading.Lock} -that ensures that only one entity\footnote{An entity can be the combox - directory monitor or one of the node directory monitors} can -access/modify the database at a time. - -Below is an illustration of the structure of the combox database: - -\begin{verbatim} -{ - "/home/rsd/combox/ipsum.txt": "e3206df2bb2b3091103ab9d...", - "/home/rsd/combox/tk-shot-osx.png": "7fcf1b44c15dd95e0...", - "/home/rsd/combox/thgttg-21st.png": "0040eedfc3eeab546...", - "/home/rsd/combox/lorem.txt": "5851dd7a4870ff165facb71...", - "/home/rsd/combox/the-red-star.jpg": "4b818126d882e552...", - "file_moved": {}, - "file_deleted": {}, - "file_created": {}, - "file_modified": {}, -} -\end{verbatim} - -The \verb+combox.silo.ComboxSilo+, which is the sole interface to read -from and write to the database, uses the pickleDB -library\cite{pylib:pickledb}. The pickleDB is a very basic key-value -store which allows one to store information in the JSON format; if I -would have not found this library or if this library was never by -Harrison Erd, I've would have written something very similar to this -library as part of combox to realize the basic key-value storage that -is needed to track the hashes of the files stored in the combox -directory. - -It must be noted that the combox database stored on each computer is -independent and does not communicate or make transactions with the -combox databases located in other computers. - -\section{combox modules overview} - -combox is spread into modules that have functions and/or classes. As -of \verb+2016-02-04+ combox is considerably a small program: - -\begin{verbatim} -$ wc -l combox/*.py - 144 combox/cbox.py - 178 combox/config.py - 241 combox/crypto.py - 891 combox/events.py - 541 combox/file.py - 454 combox/gui.py - 0 combox/__init__.py - 71 combox/log.py - 278 combox/silo.py - 29 combox/_version.py - 2827 total -\end{verbatim} - -This section gives an overview of each of the combox modules with -extreme brevity: - -\begin{description} -\item[combox.cbox] This module contains \verb+run_cb+ function runs - combox; it creates an instance \verb+threading.Lock+ for database - access and a shared \verb+threading.Lock+ for the - \verb+combox.events.ComboxDirMonitor+ and - \verb+combox.events.NodeDirMonitor+; it initializes an instance - \verb+combox.events.ComboxDirMonitor+ that monitors the combox - directory and an instance of \verb+combox.events.NodeDirMonitor+ for - each node directory for monitoring the node directories. This - modules also houses the \verb+main+ function that parses commandline - arguments, starts combox configuration if needed or loads the combox - configuration file to start running combox. -\item[combox.config] Accomodates two import functions -- - \verb+config_cb+ and \verb+get_nodedirs+. The \verb+config_cb+ is - the combox configuration function that allows the user to configure - combox; this function was designed in a such way that it was - possible to use for both CLI and GUI methods of configuring - combox. The \verb+get_nodedirs+ function returns, as a list, the - paths of the node directories; this function use used in numerous - places in other combox modules. -\item[combox.crypto] This has functions for encrypting and decrypting - data; encrypting and decrypting shards (\verb+encrypt_shards+ and - \verb+decrypt_shards+); a function for splitting a file into shards, - encrypting those shards and spreading them across node directories - (\verb+split_and_encrypt+); a function for decrypting the shards - from the node directories, reconstructing the file from the - decrypted shards and put the file back to the combox directory - (\verb+decrypt_and_glue+). Functions \verb+split_and_encrypt+ and - \verb+decrypt_and_glue+ are the two functions that that are - extensively used by the \verb+combox.events+ module; all other - functions in this module are pretty much helper functions are - \verb+split_and_encrypt+ and \verb+decrypt_and_glue+ functions and - are not used by other modules. -\item[combox.events] This module took the most time to write and test - and it is the most complex module in combox at the time of writing - this report. It contains just two classes -- \verb+ComboxDirMonitor+ - and \verb+NodeDirMonitor+. The \verb+ComboxDirMonitor+ inherits the - \verb+watchdog.events.LoggingEventHandler+ and is responsible for - monitoring for changes in the combox directory and doing the right - thing when change happens in the combox directory. The - \verb+NodeDirMonitor+ also inherits - \verb+watchdog.events.LoggingEventHandler+ and similarly responsible - for monitoring a node directory and doing the right thing when a - change happens in the node directory; subjectively, - \verb+NodeDirMonitor+ is slightly more complex than the - \verb+ComboxDirMonitor+. -\item[combox.file] This is the second largest module in combox. It - contains utility functions for reading, writing, moving - files/directiores, hashing files, splitting a file into shards, glue - shards into a file, manipulating directories inside combox and node - directories. -\item[combox.gui] Contains the \verb+ComboxConfigDialog+ class; it is - the graphical interface for configuring combox. The class uses the - Tkinter library\cite{pylib:tkinter} for spawing graphical - elements. Other graphical libraries include PyQt\cite{pylib:qt} - were considered Tkinter was chosen over others because it works on - all Unix systems and Microsoft's Windows and it is part of the core - python (version 3). -\item[combox.log] All the messages to \verb+stdout+ and \verb+stderr+ - are sent through the functions \verb+log_i+ and \verb+log_e+ - functions defined in this module. -\item[combox.silo] Contains the \verb+ComboxSilo+ class which is the - canonical interface for combox for managing information about the - files in the combox directory. Internally, the \verb+ComboxSilo+ - class uses the pickleDB library\cite{pylib:pickledb}. -\item[combox.\_version] This is \emph{private} module that contains - variables that contain the value of the present version and release - of combox. The \verb+get_version+ function in this module returns - the full version number; this function used by \verb+setup.py+. -\end{description} - -\section{Language choice} - -Back in October of 2014, I was learning to write in Python and when I -had to start working on combox, I chose to write combox in Python. In -my first commit to the combox repository, I had to say this about -Python: - -\begin{verbatim} -commit 2def977472b2e77ee88c9177f2d03f12b0263eb0 -Author: rsiddharth <rsiddharth@ninthfloor.org> -Date: Wed Oct 29 23:24:58 2014 -0400 - - Initial commit: File splitter & File gluer done. - - ... - - I like to write python FWIW. But after reading a dialect of Lisp when - I come back to python, it does not look very beautiful. I guess I'm - pretty convinced that there is no language that can ape the beauty of - Lisp. -\end{verbatim} - -If I were to write that commit message today (\verb+2016-02-04+), I -would've phrased my reflections about Python differently. While I've -not found a language that is as intrinsically beautiful as Lisp, I -think it is not quite right to compare Lisp and Python. Python is a -very readable language and it tends to be very accessible to -beginners. Also, it is hard to write unreadable Python code. - -\section{DRY} - -The core functionality of combox is to split, encrypt file shards, -spread them across node directories (Google Drive and Dropbox) and -decrypt, glue shards and put them back to the combox directory when a -file is created/modified/deleted/moved in another computer. The plan -was to use external libraries to accomplish things that fell outside -the realm of what I consider the ``core functionality of combox''; the -main reason behind this decision was to duly be an indolent programmer -and not indulge in trying to solve problems that others have already -solved. - -The \verb+watchdog+\cite{pylib:watchdog} library was chosen for file -monitoring; this library is compatible with Unix systems and -Windows. The \verb+pycrypto+ library\cite{pylib:pycrypto} was used for -encrypting data; combox uses AES encryption scheme to encrypt file -shards. The \verb+pickleDB+\cite{pylib:pickledb} library was used to -store information about files in the combox directory; this library is -not very clean, but, it was what I exactly looking for, if there was -no \verb+pickleDB+, I would've most probably written something similar -to it and made it as part of combox. - -Looking back, the decision to use external libraries reduced the -complexity of combox, reduced the time to complete the initial working -version of combox and made it possible to spend more than 3 months -just testing and fixing issues in combox. - -\section{Operating system compatibility}\label{4-os-compat} - -combox was developed on a GNU/Linux machine, a conscious effort was -made to write in an operating system independent way. The top criteria -for choosing a library to use in combox was that it had to be -compatible on \emph{all} of the three major computing platforms in -2014-2016\footnote{GNU/Linux, OS X and, Windows}. - -As we were nearing the \verb+0.1.0+ release, combox was tested on OS X -(See chapter \ref{ch:5}) and OS X specific issues that were found -eventually were eventually fixed. The initial \verb+0.1.0+ release was -compatible with GNU/Linux and OS X. - -After the initial release of combox, we wanted to see if combox would -be compatible with Windows. We found that: - -\begin{itemize} -\item Setting up the parapharnalia to run combox was - non-trivial\cite{doc:combox-setup-windoze}. -\item The unit tests for the \verb+combox.file+ module royally failed. -\end{itemize} - -At the time of writing the report, combox is in version \verb+0.2.2+ -and it still not compatible with Windows. Comprehensive documentation -of setting up the development environment for combox on Windows was -written\cite{doc:combox-setup-windoze} to make it less cumbersome for -anyone who would want to work on making combox compatible with -Windows. - -\section{combox as a python package}\label{4-pypi} - -Before version \verb+0.2.0+, the canonical way to install combox was -to pull the source from the \verb+git+ repository with: - -\begin{verbatim} - git clone git://ricketyspace.net/combox.git -\end{verbatim} - -Then, do: - -\begin{verbatim} - cd combox -\end{verbatim} - -Finally install combox with: - -\begin{verbatim} - python setup.py install -\end{verbatim} - -Yes, installing combox on a machine was indeed non-trivial. - -Python has a package registry called CheeseShop\footnote{code name for - Python Package Index, see https://wiki.python.org/moin/CheeseShop}; -all packages registered at the CheeseShop can be installed using -\verb+pip+ -- Python's platform independent package managment -system\cite{py:pip} -- with: - -\begin{verbatim} - pip install packagename -\end{verbatim} - -To make it easier for (python) users to install combox on their -machine, an effort was made to make it a python -package\cite{py:package-guide}. From version \verb+0.2.0+, combox has -been registered python package at the CheeseShop. (Python) users can -now easily get a copy of combox on their machine with: - -\begin{verbatim} - pip install combox -\end{verbatim} - -All versions of combox that is available through the CheeseShop are -digitally signed using the following GPG key: - -\begin{verbatim} -pub 4096R/00B252AF 2014-09-08 [expires: 2017-09-07] - Key fingerprint = C174 1162 CEED 5FE8 9954 A4B9 9DF9 7838 00B2 52AF -uid Siddharth Ravikumar (sravik) <sravik@bgsu.edu> -sub 4096R/09CECEDB 2014-09-08 [expires: 2017-09-07] -\end{verbatim} - -All versions of combox's source are also available as a compressed -\verb+TAR+ ball and as a \verb+ZIP+ archive; they can be downloaded -from \url{https://ricketyspace.net/combox/releases.html}. - -\section{With the benefit of hindsight}\label{4-hindsight} - -combox's node monitor (\verb+combox.events.NodeDirMonitor+) was -written with the assumption that the node monitor will be the only -entity that will be making changes to the node directory that it is -monitoring. When started testing combox with node clients (Dropbox -client and Google Drive client), we observed that the node clients -made changes to the node directory when a file was -created/modified/renamed/deleted; for instance, when a shard, in the -Dropbox node directory, was modified on a remote computer, the Dropbox -client would first pull the newer version of the shard under the -\verb+.dropbox.cache+ directory as a temprorary file, move the older -version of the shard under \verb+.dropbox.cache+ as a backup, and -finally move the latest version of the shard, stored as a temprorary -file under the \verb+.dropbox.cache+ directory, to the respective -location in the Dropbox node directory; when a shard, in the Google -Drive node directory, was remotely modified on a remote computer, the -Google Drive client would delete the older version of the shard from -the Google Drive node directory and then create the newer version of -the shard in the respective location under the Google Drive node -directory. Since combox did not know about the node client's -behaviour, it confused combox and broke it royally; we had to make -major changes to the \verb+combox.eventns.NodeDirMonitor+ class to -make combox aware of the node client's behavior, this eventually -brutally obliterated the simplicity of the -\verb+combox.eventns.NodeDirMonitor+ class which I was proud of. - -I'm not sure how I would have written the \verb+combox.events+ module -if I had known about the Dropbox and Google Drive client's behaviour -before writing the \verb+combox.events.NodeDirMonitor+ or the -\verb+combox.events.ComboxDirMonitor+ classes. Looking back, if there one -thing I would want to re-think/redo, it is the \verb+combox.events+ -module. - -The most important lesson I'm taking away from the experience of -writing combox is the insight of how easy it is to ruthless crush the -simplicity of a program due to unforeseen use cases. - -\verb+<3+ diff --git a/report/chapters/4-testing.tex b/report/chapters/4-testing.tex @@ -0,0 +1,668 @@ +\chapter{Testing}\label{ch:5} + +\epigraph{Testing shows the presence, not the absence of + bugs.}{\textit{Dijkstra}\cite{dijkstra69}} + +\section{Unit testing}\label{sec:5-unit-testing} + +The \verb+nose+\cite{pylib:nose} testing framework was used to +write unit tests for the functions and classes part of the +\verb+combox.config+, \verb+combox.crypto+, \verb+combox.events+, +\verb+combox.file+, \verb+combox.silo+ \verb+combox._version+ +modules. Unit tests were not written for \verb+combox.cbox+, +\verb+combox.gui+, \verb+combox.combox.log+ modules. + +Unit tests for combox become reality by pure serendipity. During the +time, when I started working on combox, I was learning to use the +\verb+nose+ library to unit test python code. Since, \verb+combox+ was +being written in python, I started making it a norm to write unit +tests for functions and classes in combox modules. + +As mentioned before, unit tests were not written for some modules +either because it would make no sense to write one (for the +\verb+combox.cbox+ module, for instance, which basically uses +functions and classes defined in other modules to run combox) or it +was not clear how to write unit tests it (the \verb+combox.gui+ +contains just the \verb+ComboxConfigDialog+ a graphical front-end +which uses the configuration function defined in the +\verb+combox.config+ module to complete the combox configuration based +on the user input). + +It must be noted here that pure Test Driven Development (TDD) was not +observed -- most of the time the function/class was written before the +its corresponding test was written. + +\subsection{Benefits} + +While writing unit tests definitely increased the time to write a +particular feature, it enabled me to immediately check if a feature +worked as it should for the given use case or given set of inputs. + +With the benefit of hindsight, unit tests greatly helped in testing +the compatibility of combox on OSX. Before the \verb+v0.1.0+ release, +combox's node directory monitor always assumed that a file's first +shard (\verb+shard0+) is always available; while this assumption did +not create any problems on GNU/Linux, on OS X, this assumption made +the node directory monitor to behave erratically -- this issue (bug +\#4\cite{combox-issue-tracker} was immediately found when the unit +tests were run for the first time on OS X. Another instance where unit +tests helped was just before the \verb+v0.2.0+ release; major changes, +including the introduction of file locks in the +\verb+ComboxDirMonitor+, were made to the \verb+combox.events+. When +the unit tests were run OS X, two tests failed, revealing a difference +in behavior of watchdog\cite{pylib:watchdog} on GNU/Linux and OS X on +file creation\cite{combox-wd-fix}; without unit tests, there is a high +probability that this bug would never have been found by now. + +\subsection{Caveats} + +Unit tests are helpful in testing the correctness of a feature for +\verb+N+ number of use cases but it does not necessarily mean the +written feature correctly behaves for use cases that the author of the +feature did not consider or did not think about while writing the +respective feature. As Dijkstra correctly observed: + +Unit tests failed to reveal bugs \#4, \#5 \#6 \#7 \#5 \#10 +\#11\cite{combox-issue-tracker}; these bugs were found when manually +testing combox. + +\section{Manual testing}\label{sec:5-manual-testing} + +The unit tests for the \verb+combox.events+ module test the +correctness of the \verb+ComboxDirMonitor+ and \verb+NodeDirMonitor+ +independently; in order to comprehensively test the correctness of +both \verb+ComboxDirMonitor+ and \verb+NodeDirMonitor+, it was +required to manually test combox running on more than one computer. As +you'll see in the following subsections, several bugs were found and +fixed while doing manual testing. + +Three different types of setups were used to test combox. The first +kind of setup has two GNU/Linux machines each using combox to sync +files between each other with Dropbox and Google Drive being the +nodes; the second kind of setup has a GNU/Linux machine and a OS X +machine each using combox to sync files between each other with +Dropbox and Google Drive being the nodes; the third kind of setup has +a GNU/Linux machine and OS X machine each using combox to sync files +between each other with Dropbox, Google Drive and a USB stick as +nodes. + +\subsection{General setup and notes} + +\begin{itemize} +\item On the GNU/Linux machines, the official Dropbox client was used + to sync the Dropbox node directory to Dropbox' + servers. \verb+rclone+\cite{program:rclone} was used to sync the + Google Drive node directory to Google Drive' servers;At the time of + testing, Google Drive did not have client for GNU/Linux. +\item On OS X, the official Dropbox client was used to sync the + Dropbox node directory to Dropbox's servers; the official Google + Drive client was used to sync the Google Drive node directory to + Google Driver' servers. +\item Since combox is extremely event-driven, combox must be started + before the Dropbox and Google Drive clients start syncing their + respective directories (nodes). +\end{itemize} + +\subsection{Testing on two GNU/Linux machines} + +combox was run to two GNU/Linux machines and a file was alternatively +created/modified/renamed/deleted on an of the GNU/Linux machine and it +was verified if the respective file was also +created/modified/renamed/deleted on the other GNU/Linux machine. One +of the GNU/Linux machine (\verb+lyra)+ was a virtual machine running +Debian GNU/Linux stable (version 8.x); the other GNU/Linux machine +(\verb+grus+) was a physical machine running Debian GNU/Linux +testing. The node directories to scatter the files' shards were the +Dropbox directory and Google Drive directory. The official Dropbox +client was used to automatically sync files from the Dropbox directory +to the Dropbox' server; \verb+rclone+\cite{program:rclone} was used to +sync files from Google Drive directory to Google Drive' server. + +\subsubsection{Issues found}\label{ch-5-2gnus-issues} + +\begin{itemize} +\item Some editors, especially on POSIX complaint systems, create + backup version of the file being edited. combox was detecting this + backup file as a ``new file'' and it split it into shards, encrypted + the shards and scattered the shards across the node directories. The + right thing for combox to do was to ignore these backup files and do + nothing about them. This issue was fixed on + \verb+2015-09-29+\cite{combox-issue-tracker}. Now the + \verb+ComboxDirMonitor+, on a ``file created'' or ``file modified'' + event, returns from the \verb+on_created+ or \verb+on_modified+ + callback when it finds that the file is a backup/temporary file. +\item Dropbox client maintains the \verb+.dropbox.cache+ directory + under the root of the Dropbox directory. + + \begin{itemize} + \item When a file (shard) was created on another computer, the + Dropbox client pulls the new file (shard) to this computer into + \verb+.dropbox.cache+ as a temporary file and then moves the new + file (shard) to its respective location with the appropriate name. + \item When a file (shard) was modified on another computer, the + Dropbox client pulls the modified file (shard) to this computer + into the \verb+.dropbox.cache+ as a temporary file; moves the old + version of the file (shard) under the Dropbox directory into the + \verb+.dropbox.cache+; finally moves the updated copy of the file, + stored as a temporary file, into the Dropbox directory to its + respective location with the appropriate name. + \item When a file (shard) was deleted on another computer, the + Dropbox client moves the delete file into the + \verb+.dropbox.cache+ directory on this computer. + \end{itemize} + + All of the above behavior of the Dropbox client epically broke + combox. Commits \verb+3d714c5+ to + \verb+6e1133f+\cite{git:dropbox-fix} fixed combox by making it aware + of Dropbox's client behavior. +\end{itemize} + +\subsubsection{Demo} + +Demo of combox being used on two GNU/Linux machines can be viewed at +\url{https://ricketyspace.net/combox/combox-2-gnus.webm}. + +\verb+lyra+ (virtual machine) and \verb+grus+ (bare-metal) are the two +GNU/Linux machines being used for the demo. + +Description of what happens in the demo follows: + + - (lyra) install combox. + + - (lyra) run combox (test mode). + + - (lyra) create file \verb+walden.pond+ with content ``It must be + beautiful there''. + + - (lyra) sync Google Drive using \verb+rclone+. + + - (grus) sync Google Drive using \verb+rclone+. + + - (grus) git pull latest copy of combox. + + - (grus) install combox + + - (grus) run combox (testing mode). + + - (grus) verify that \verb+walden.pond+ was create on this machine. + + - (grus) append 'Peaceful too.' to \verb+walden.pond+. + + - (grus) sync Google Drive using \verb+rclone+. + + - (lyra) sync Google Drive using \verb+rclone+. + + - (lyra) verify that the latest copy of \verb+walden.pond+ is there + in the combox directory; it should contain 'Peaceful too.' in the + last line. + + - (lyra) append ``I've a dream'' to \verb+walden.pond+. + + - (lyra) sync Google Drive using \verb+rclone+. + + - (grus) sync Google Drive using \verb+rclone+. + + - (grus) verify that the latest copy of \verb+walden.pond+ is there + in the combox directory; it should contain ``I've a dream'' in the + last line. + + - (grus) remove \verb+walden.pond+ from combox directory. + + - (grus) sync Google Drive using \verb+rclone+. + + - (lyra) sync Google Drive using \verb+rclone+. + + - (lyra) verify that \verb+walden.pond+ is removed from the combox + directory. + + - (grus) open dropbox and Google drive accounts from the web browser. + + - (lyra) create file \verb+manufacturing.consent.+ with content ``Chomsky stuff?''. + + - (lyra) sync Google Drive using \verb+rclone+. + + - (grus) sync Google Drive using \verb+rclone+. + + - (grus) verify that \verb+manufacturing.consent+ was created in the + combox directory. + + - (grus) verify that the shards of \verb+manufacturing.consent+ were + created on Dropbox and Google Drive through the web browser. + +\subsection{Testing on a GNU/Linux and an OS X machine} + +combox was run on a GNU/Linux machine and an OS X machine and a file +was alternatively created/modified/renamed/deleted on one of the +machine and it was verified if the respective file was also +created/modified/renamed/deleted on the other machine. The GNU/Linux +machine was a virtual machine (\verb+lyra+) running Debian GNU/Linux +stable; the OS X machine was on Mavericks (10.9) during the initial +stage of testing, later it was upgraded to Yosemite (10.10). The node +directories to scatter files' shards were the Dropbox directory and +the Google Drive directory. The official Dropbox client was used to +automatically sync files from the Dropbox directory to the Dropbox' +server on both the GNU/Linux machine and the OS X machine; the +official Google Drive client was used to automatically sync files from +the Google Drive directory to Google Drive' server on OS X and +\verb+rclone+\cite{program:rclone} was used to sync files from the +Google Drive directory to Google Drive's server on GNU/Linux. + +\subsubsection{Issues found} + +\begin{itemize} +\item When a file was modified on another computer, on this computer + combox assumed that first shard (shard0) will be updated first and + also counted on the existence of the first shard (shard0). It was + observed that the order in which the shards were updated were + unpredictable on this computer and if the first shard (shard0) was + stored in the Dropbox directory, it will momentarily disappear + before the most updated shard becomes available in the Dropbox + directory; this broke combox. This issue was fixed on + 2015-08-25\cite{git:bug-four-fix}. This issue is not got to do with + the nature of the setup but it is related to the Dropbox's behavior + elaborated in section \ref{ch-5-2gnus-issues}. +\item The official Google Drive client when it pulls an updated + version of the file from Google Drive' server, instead directly + updating the respective file on the computer, it deletes the older + version of the file and creates the latest version of the file at + the respective location in the Google Drive directory; this behavior + of the Google Drive confused and broke combox. This issue was fixed + 2015-09-06 by making combox under the official Google Client's + behavior\cite{git:bug-googledc-fix}. +\item When a non-empty directory was move/renamed on another computer, + the old directory was not getting properly deleted on this computer; + this was happening because the files under the directory being + renamed were not deleted when it was time for \verb+NodeDirMonitor+ + to \verb+rmdir+ the old directory. This issue again is not specific + to the nature of the setup but was found while testing combox on + this setup. This issue was fixed on + 2015-09-12\cite{git:bug-six-fix}. +\item It was found that \verb+combox.file.rm_path+ function failed + when it was given a non-existent path to remove; this issue was + fixed on 2015-09-12\cite{git:bug-seven-fix}. +\end{itemize} + +\subsubsection{Demo} + +Demo of combox being used on a GNU/Linux machine and OS X machine can +be viewed at \url{https://ricketyspace.net/combox/combox-gnu-osx.webm} + +\verb+lyra+ is the GNU/Linux (virtual) machine and +\verb+dhcp-129-1-66-1+ is the OS X machine that is being used for the +demo. The OS X machine is accessed through VNC\cite{article:vnc}. + +Description of what happens in the demo follows: + + - (\verb+lyra+) create file \verb+cat.stevens+ with content ``peace train''. + + - (\verb+lyra+) sync Google Drive using \verb+rclone+. + + - (\verb+dhcp-129-1-66-1+) verify that file \verb+cat.stevens+ is + created with content ``peace train''. + + - (\verb+dhcp-129-1-66-1+) append string ``moonshadow'' to file + \verb+cat.stevens+. + + - (\verb+lyra+) sync Google Drive using \verb+rclone+. + + - (\verb+lyra+) verify that the file \verb+cat.stevens+ was updated + (modified); last line must have the string ``moonshadow''. + + - (\verb+lyra+) append string ``father and son'' to the file + \verb+cat.stevens+. + + - (\verb+lyra+) sync Google Drive using \verb+rclone+. + + - (\verb+dhcp-129-1-66-1+) verify that the file \verb+cat.stevens+ + was updated (modified); last line must have the string ``father and + son''. + + - (\verb+dhcp-129-1-66-1+) rename file \verb+cat.stevens+ to + \verb+yusuf.islam+ + + - (\verb+lyra+) sync Google Drive using \verb+rclone+. + + - (\verb+lyra+) verify that the file \verb+cat.stevens+ was renamed + to \verb+yusuf.islam+. + +\subsection{Testing with a USB stick as a node} + +combox was run on a GNU/Linux machine and an OS X machine and a file +was alternatively created/modified/deleted on one of the machine and +it was verified if the repsective file was also +create/modified/deleted on the other machine. The GNU/Linux machine +was a physical machine (\verb+grus+) running Debian GNU/Linux stable; +The OS X machine was on Mavericks (10.9). The node directories to +scatter files' shards were the Dropbox directory, Google Drive +directory and the USB stick (\verb+ZAPHOD+, FAT filesystem). The +official Dropbox client was used to automatically sync files from +Dropbox directory to Dropbox' server on both the GNU/Linux machine and +OS X machine; the official Google Drive client was used to +automatically sync files from the Google Drive directory to Google +Drive' server on OS X and \verb+rclone+\cite{program:rclone} was used +to sync files from the Google Drive directory to Google Drive's server +on GNU/Linux; the same USB stick (\verb+ZAPHOD+) was used on bothe +GNU/Linux and Dropbox to store the third shard (shard2) of a file. + +\subsubsection{Caveats} + +\begin{itemize} +\item When a removable USB disk is used as a node, combox must be + turned off before ejecting/unmounting the USB disk; combox does not + expect a node directory to disappear when it is running, if the USB + disk is removed when combox is running, then combox goes to a + undefined state. + +\item When a file modified on machine A is synced to machine B, combox + must be turned on first before turning on Dropbox and Google Drive + clients and the shard in the USB disk needs to be ``touched'' for + combox to detect that the file was modified on the remote computer + and update the file locally on this machine. + +\item File rename/move does not work. To make it work, core + functionality of combox must be re-written. +\end{itemize} + +\subsubsection{Demo} + +Demo of combox being used with a USB stick as the third node can be +view at \url{https://ricketyspace.net/combox/combox-usb-node-demo.webm} + +\verb+grus+ is the GNU/Linux machine and \verb+dhcp-129-1-66-1+ is the +OS X machine that is being used for the demo. \verb+ZAPHOD+ is the +FAT32 USB stick used as the third node. + +Description of what happens in the demo follows: + + - (\verb+grus+) start combox. + + - (\verb+grus+) create a file called \verb+simon.and.garfunkel+ with + content ``the boxer''. + + - (\verb+grus+) sync Google Drive using \verb+rclone+. + + - (\verb+grus+) stop combox. + + - (\verb+grus+) unmount USB stick (\verb+ZAPHOD+) from \verb+grus+. + + - (\verb+dhcp-129-1-66-1+) mount USB stick (\verb+ZAPHOD+) to + (\verb+dhcp-129-1-66-1+). + + - (\verb+dhcp-129-1-66-1+) start Dropbox client. + + - (\verb+dhcp-129-1-66-1+) start Google Drive client. + + - (\verb+dhcp-129-1-66-1+) start combox. + + - (\verb+dhcp-129-1-66-1+) verify that the file + \verb+simon.and.garfunkel+ with content ``the boxer'' was created. + + - (\verb+dhcp-129-1-66-1+) append string ``mrs. robinson'' to file + \verb+simon.and.garfunkel+. + + - (\verb+dhcp-129-1-66-1+) stop combox. + + - (\verb+dhcp-129-1-66-1+) stop Google Drive client. + + - (\verb+dhcp-129-1-66-1+) stop Dropbox client. + + - (\verb+dhcp-129-1-66-1+) unmount the USB stick (\verb+ZAPHOD+) + from (\verb+dhcp-129-1-66-1+). + + - (\verb+grus+) mount the USB stick (\verb+ZAPHOD+) to + (\verb+grus+). + + - (\verb+grus+) start combox. + + - (\verb+grus+) start Dropbox client. + + - (\verb+grus+) sync Google Drive using \verb+rclone+. + + - (\verb+grus+) touch \verb+simon.and.garfunkel.shard2+ in the USB + stick (\verb+ZAPHOD+). + + - (\verb+grus+) verify that the file \verb+simon.and.garfunkel+ is + updated; the last line must contain the string ``mrs. robinson''. + + - (\verb+grus+) remove the file \verb+simon.and.garfunkel+. + + - (\verb+grus+) sync Google Drive using \verb+rclone+. + + - (\verb+grus+) unmount the USB stick (\verb+ZAPHOD+) from + (\verb+grus+). + + - (\verb+grus+) stop Dropbox client. + + - (\verb+dhcp-129-1-66-1+) mount the USB stick (\verb+ZAPHOD+) to + (\verb+dhcp-129-1-66-1+). + + - (\verb+dhcp-129-1-66-1+) start Google Drive client. + + - (\verb+dhcp-129-1-66-1+) start Dropbox client. + + - (\verb+dhcp-129-1-66-1+) start combox. + + - (\verb+dhcp-129-1-66-1+) verify that the file + \verb+simon.and.garfunkel+ was deleted. + + +\section{Stress testing} + +Large number of files of different sizes were dumped to the combox +directory between an one second interval to see how combox responds to +high load. The file dump size was varied from \verb+424.798190MiB+ (27 +files) to \verb+10800.000000MiB+ (180 files); the average time taken +to split a file and the total time to process all files were +calculated for each dump. + +Stress testing was first done on \verb+2015-11-08+. In mid November +the \verb+ComboxDirMonitor+ was drastically modified to make it use +the file Lock shared the instances of +\verb+NodeDirMonitor+\cite{git:bug-eleven-fix}; my hunch was that this +change in \verb+ComboxDirMonitor+ directly affected the performance of +combox and therefore the results that were got from stress testing on +\verb+2015-11-08+ would no longer be valid. Stress testing was again +done on \verb+2016-01-16+; the results of this stress test are in +sections \ref{5-st-424} to \ref{5-st-10800}, section \ref{5-st-tu} +gives information about the tools used for stress testing, section +\ref{5-st-o} contains the observations and comparisons between this +stress test and the one done on \verb+2015-11-08+, lastly section +\ref{5-st-if} reveals the issues that were found with combox by virtue +of doing the stress tests. + +\subsection{flac dump (27 files - 424.798190MiB)}\label{5-st-424} + +\begin{center} +\begin{tabular}{ll} +field & value\\ +\hline +delay between a file dump & 1s\\ +start time of processing & 11:00:54\\ +end time of processing & 11:01:38\\ +total time taken to process all files & 00:00:44\\ +no. of files & 27\\ +total size of all files & 445433187.000000 bytes (424.798190MiB)\\ +avg. file size & 16497525.000000 bytes (15.733266MiB)\\ +avg. time to split and encrypt a file & 352.583370 ms\\ +\end{tabular} +\end{center} + +\subsubsection{Differences from previous stress test (2015-11-08)} + +\begin{itemize} +\item Total time to process all files was faster by 1min3secs. +\item Average time to split and encrypt a file reduced by + 28.337963000000002ms. +\end{itemize} + +\subsection{20MiB - 90MiB dump (27 files - 1620.000000MiB)}\label{5-st-1620} + +\begin{center} +\begin{tabular}{ll} +field & value\\ +\hline +delay between a file dump & 1s\\ +start time of processing & 12:26:45\\ +end time of processing & 12:29:07\\ +total time taken to process all files & 00:02:22\\ +no. of files & 27\\ +total size of all files & 1698693120.000000 bytes (1620.000000MiB)\\ +avg. file size & 62914560.000000 bytes (60.000000MiB)\\ +avg. time to split and encrypt a file & 2670.596556ms\\ +\end{tabular} +\end{center} + +\subsubsection{Differences from previous stress test (2015-11-08)} + +\begin{itemize} +\item Total time to process all files was slower by 4secs. +\item Average time to split and encrypt a file reduced by + 25.52536999999984ms. +\end{itemize} + +\subsection{20MiB - 90MiB dump (99 files - 5940.000000MiB)}\label{5-st-5940} + +\begin{center} +\begin{tabular}{ll} +field & value\\ +\hline +delay between a file dump & 1s\\ +start time of processing & 13:10:16\\ +end time of processing & 13:19:26\\ +total time taken to process all files & 00:09:10\\ +no. of files & 99\\ +total size of all files & 6228541440.000000 bytes (5940.000000MiB)\\ +avg. file size & 62914560.000000 bytes (60.000000MiB)\\ +avg. time to split and encrypt a file & 2979.647586ms\\ +\end{tabular} +\end{center} + +\subsubsection{Differences from previous stress test (2015-11-08)} + +\begin{itemize} +\item Total time to process all files was faster by 59secs. +\item Average time to split and encrypt a file increased by + 206.20906100000002ms. +\end{itemize} + +\subsection{20MiB - 90MiB dump (180 files - 10800.000000MiB)}\label{5-st-10800} + +\begin{center} +\begin{tabular}{ll} +field & value\\ +\hline +delay between a file dump & 1s\\ +start time of processing & 13:42:06\\ +end time of processing & 14:00:10\\ +total time taken to process all files & 00:18:04\\ +no. of files & 180\\ +total size of all files & 11324620800.000000 bytes (10800.000000MiB)\\ +avg. file size & 62914560.000000 bytes (60.000000MiB)\\ +avg. time to split and encrypt a file & 3423.087539ms\\ +\end{tabular} +\end{center} + +\subsubsection{Differences from previous stress test (2015-11-08)} + +\begin{itemize} +\item Total time to process all files was slower by 1min2secs +\item Average time to split and encrypt a file increased by + 399.87623299999996ms. +\end{itemize} + +\subsection{Tools used}\label{5-st-tu} + +The \verb+dump+ script\cite{program:dump} was used to dump files to +the combox directory between one second intervals; a night of Emacs +Lisp indulgence made it possible to quickly slurp the required data +from the combox output and calculate the average time to split and +encrypt a file and the total amount of time taken to process the files +for a given dump\cite{program:dumps.el}; lastly \verb+org-mode+ was +used to document all data gathered during stress +testing\cite{doc:benchmarks.org}. + +\subsection{Observations}\label{5-st-o} + +\begin{figure}[h] +\centering +\input{graphs/tot-time.tex} +\caption{time to process all files} +\label{fig:5-st-tt} +\end{figure} + +\begin{figure}[h] +\centering +\input{graphs/avg-time-sae.tex} +\caption{avg. time to split and encrypt} +\label{fig:5-st-atsae} +\end{figure} + + +\begin{itemize} +\item Figure \ref{fig:5-st-tt} shows the time it takes combox to + process files for a given file dump\footnote{A ``file dump'' here + means a bunch of files copied to the combox directory between 1 + sec intervals.}. As can be observed from the graph, the total time + taken to process all the files tends almost linearly increase with + the increase in the size of the file dump\footnote{The ``size of the + file dump'' is the total size of all files in a given file dump.}. +\item Figure \ref{fig:5-st-atsae} show the average time it takes + combox to split and encrypt a file for a given file dump. There is a + steep increase in the average time from the \verb+424.798190MiB+ + dump and the \verb+1620.000000MiB+ dump, after which the average + time to split and encrypt a file seems to almost linearly increase; + The main reason for this is that the average file size for dumps + from \verb+1620.000000MiB+ to \verb+10800.000000MiB+ are the same. +\end{itemize} + +\begin{figure}[h] +\centering +\input{graphs/tot-time-diff.tex} +\caption{time to process all files - difference between 2015 and 2016} +\label{fig:5-st-tt-diff} +\end{figure} + +\begin{figure}[h] +\centering +\input{graphs/avg-time-sae-diff.tex} +\caption{avg. time to split and encrypt - difference between 2015 and 2016} +\label{fig:5-st-atsae-diff} +\end{figure} + +\begin{itemize} +\item Figure \ref{fig:5-st-tt-diff} shows the graphs for the total + amount of time taken to process all files for a given file dump in + the \verb+2016-01-16+ and \verb+2015-11-8+ stress test. The amount + of time needed to process all fills seems to be reduced for the + \verb+5940.000000MiB+ file dump when compared to the \verb+2015+ + stress test results and it seems to be slightly higher for the + \verb+10800.000000MiB+ file dump when compared to the \verb+2015+ + stress test. +\item Similarly, figure \ref{fig:5-st-atsae-diff} shows the graphs for + the average time to split and encrypt for a given file dump in the + \verb+2016-01-16+ and the \verb+2015-11-8+ stress test. The average + time taken seems to able almost the same for the + \verb+424.798190MiB+ and the \verb+1620.000000+ dump, but for the + \verb+5940.000000MiB+ and the \verb+10800.000000MiB+ dump the + average time taken seems to higher for the \verb+2016+ stress test + when compared to the \verb+2015+ stress test. +\end{itemize} + +\subsection{Issues found}\label{5-st-if} + +\begin{itemize} +\item Initially when combox was stress tested with huge files, combox + would get overwhelmed leading to the computer running out of memory + and the load average sometimes peaking at \verb+8+. At first, it was + assumed that there was a bug in combox which caused this to happen, + but later it was found that \verb+watchdog+\cite{pylib:watchdog} was + generating a large number ``file modified'' events when a huge file + (\verb+~500MiB+ was modified). To prevent \verb+watchdog+ from + generating a large number ``file modified'' events for a single + modification of a huge file, a delay proportional to the size of the + file was created in the \verb+on_modified+ callback methods in both + \verb+ComboxDirMonitor+ and + \verb+NodeDirMonitor+\cite{git:bug-ten-fix}, this fixed the + issue. Also, this it might be useful to note here that this was + ``the'' hardest issue I dealt with in working on combox. +\end{itemize}+ \ No newline at end of file diff --git a/report/chapters/5-testing.tex b/report/chapters/5-testing.tex @@ -1,668 +0,0 @@ -\chapter{Testing}\label{ch:5} - -\epigraph{Testing shows the presence, not the absence of - bugs.}{\textit{Dijkstra}\cite{dijkstra69}} - -\section{Unit testing}\label{sec:5-unit-testing} - -The \verb+nose+\cite{pylib:nose} testing framework was used to -write unit tests for the functions and classes part of the -\verb+combox.config+, \verb+combox.crypto+, \verb+combox.events+, -\verb+combox.file+, \verb+combox.silo+ \verb+combox._version+ -modules. Unit tests were not written for \verb+combox.cbox+, -\verb+combox.gui+, \verb+combox.combox.log+ modules. - -Unit tests for combox become reality by pure serendipity. During the -time, when I started working on combox, I was learning to use the -\verb+nose+ library to unit test python code. Since, \verb+combox+ was -being written in python, I started making it a norm to write unit -tests for functions and classes in combox modules. - -As mentioned before, unit tests were not written for some modules -either because it would make no sense to write one (for the -\verb+combox.cbox+ module, for instance, which basically uses -functions and classes defined in other modules to run combox) or it -was not clear how to write unit tests it (the \verb+combox.gui+ -contains just the \verb+ComboxConfigDialog+ a graphical front-end -which uses the configuration function defined in the -\verb+combox.config+ module to complete the combox configuration based -on the user input). - -It must be noted here that pure Test Driven Development (TDD) was not -observed -- most of the time the function/class was written before the -its corresponding test was written. - -\subsection{Benefits} - -While writing unit tests definitely increased the time to write a -particular feature, it enabled me to immediately check if a feature -worked as it should for the given use case or given set of inputs. - -With the benefit of hindsight, unit tests greatly helped in testing -the compatibility of combox on OSX. Before the \verb+v0.1.0+ release, -combox's node directory monitor always assumed that a file's first -shard (\verb+shard0+) is always available; while this assumption did -not create any problems on GNU/Linux, on OS X, this assumption made -the node directory monitor to behave erratically -- this issue (bug -\#4\cite{combox-issue-tracker} was immediately found when the unit -tests were run for the first time on OS X. Another instance where unit -tests helped was just before the \verb+v0.2.0+ release; major changes, -including the introduction of file locks in the -\verb+ComboxDirMonitor+, were made to the \verb+combox.events+. When -the unit tests were run OS X, two tests failed, revealing a difference -in behavior of watchdog\cite{pylib:watchdog} on GNU/Linux and OS X on -file creation\cite{combox-wd-fix}; without unit tests, there is a high -probability that this bug would never have been found by now. - -\subsection{Caveats} - -Unit tests are helpful in testing the correctness of a feature for -\verb+N+ number of use cases but it does not necessarily mean the -written feature correctly behaves for use cases that the author of the -feature did not consider or did not think about while writing the -respective feature. As Dijkstra correctly observed: - -Unit tests failed to reveal bugs \#4, \#5 \#6 \#7 \#5 \#10 -\#11\cite{combox-issue-tracker}; these bugs were found when manually -testing combox. - -\section{Manual testing}\label{sec:5-manual-testing} - -The unit tests for the \verb+combox.events+ module test the -correctness of the \verb+ComboxDirMonitor+ and \verb+NodeDirMonitor+ -independently; in order to comprehensively test the correctness of -both \verb+ComboxDirMonitor+ and \verb+NodeDirMonitor+, it was -required to manually test combox running on more than one computer. As -you'll see in the following subsections, several bugs were found and -fixed while doing manual testing. - -Three different types of setups were used to test combox. The first -kind of setup has two GNU/Linux machines each using combox to sync -files between each other with Dropbox and Google Drive being the -nodes; the second kind of setup has a GNU/Linux machine and a OS X -machine each using combox to sync files between each other with -Dropbox and Google Drive being the nodes; the third kind of setup has -a GNU/Linux machine and OS X machine each using combox to sync files -between each other with Dropbox, Google Drive and a USB stick as -nodes. - -\subsection{General setup and notes} - -\begin{itemize} -\item On the GNU/Linux machines, the official Dropbox client was used - to sync the Dropbox node directory to Dropbox' - servers. \verb+rclone+\cite{program:rclone} was used to sync the - Google Drive node directory to Google Drive' servers;At the time of - testing, Google Drive did not have client for GNU/Linux. -\item On OS X, the official Dropbox client was used to sync the - Dropbox node directory to Dropbox's servers; the official Google - Drive client was used to sync the Google Drive node directory to - Google Driver' servers. -\item Since combox is extremely event-driven, combox must be started - before the Dropbox and Google Drive clients start syncing their - respective directories (nodes). -\end{itemize} - -\subsection{Testing on two GNU/Linux machines} - -combox was run to two GNU/Linux machines and a file was alternatively -created/modified/renamed/deleted on an of the GNU/Linux machine and it -was verified if the respective file was also -created/modified/renamed/deleted on the other GNU/Linux machine. One -of the GNU/Linux machine (\verb+lyra)+ was a virtual machine running -Debian GNU/Linux stable (version 8.x); the other GNU/Linux machine -(\verb+grus+) was a physical machine running Debian GNU/Linux -testing. The node directories to scatter the files' shards were the -Dropbox directory and Google Drive directory. The official Dropbox -client was used to automatically sync files from the Dropbox directory -to the Dropbox' server; \verb+rclone+\cite{program:rclone} was used to -sync files from Google Drive directory to Google Drive' server. - -\subsubsection{Issues found}\label{ch-5-2gnus-issues} - -\begin{itemize} -\item Some editors, especially on POSIX complaint systems, create - backup version of the file being edited. combox was detecting this - backup file as a ``new file'' and it split it into shards, encrypted - the shards and scattered the shards across the node directories. The - right thing for combox to do was to ignore these backup files and do - nothing about them. This issue was fixed on - \verb+2015-09-29+\cite{combox-issue-tracker}. Now the - \verb+ComboxDirMonitor+, on a ``file created'' or ``file modified'' - event, returns from the \verb+on_created+ or \verb+on_modified+ - callback when it finds that the file is a backup/temporary file. -\item Dropbox client maintains the \verb+.dropbox.cache+ directory - under the root of the Dropbox directory. - - \begin{itemize} - \item When a file (shard) was created on another computer, the - Dropbox client pulls the new file (shard) to this computer into - \verb+.dropbox.cache+ as a temporary file and then moves the new - file (shard) to its respective location with the appropriate name. - \item When a file (shard) was modified on another computer, the - Dropbox client pulls the modified file (shard) to this computer - into the \verb+.dropbox.cache+ as a temporary file; moves the old - version of the file (shard) under the Dropbox directory into the - \verb+.dropbox.cache+; finally moves the updated copy of the file, - stored as a temporary file, into the Dropbox directory to its - respective location with the appropriate name. - \item When a file (shard) was deleted on another computer, the - Dropbox client moves the delete file into the - \verb+.dropbox.cache+ directory on this computer. - \end{itemize} - - All of the above behavior of the Dropbox client epically broke - combox. Commits \verb+3d714c5+ to - \verb+6e1133f+\cite{git:dropbox-fix} fixed combox by making it aware - of Dropbox's client behavior. -\end{itemize} - -\subsubsection{Demo} - -Demo of combox being used on two GNU/Linux machines can be viewed at -\url{https://ricketyspace.net/combox/combox-2-gnus.webm}. - -\verb+lyra+ (virtual machine) and \verb+grus+ (bare-metal) are the two -GNU/Linux machines being used for the demo. - -Description of what happens in the demo follows: - - - (lyra) install combox. - - - (lyra) run combox (test mode). - - - (lyra) create file \verb+walden.pond+ with content ``It must be - beautiful there''. - - - (lyra) sync Google Drive using \verb+rclone+. - - - (grus) sync Google Drive using \verb+rclone+. - - - (grus) git pull latest copy of combox. - - - (grus) install combox - - - (grus) run combox (testing mode). - - - (grus) verify that \verb+walden.pond+ was create on this machine. - - - (grus) append 'Peaceful too.' to \verb+walden.pond+. - - - (grus) sync Google Drive using \verb+rclone+. - - - (lyra) sync Google Drive using \verb+rclone+. - - - (lyra) verify that the latest copy of \verb+walden.pond+ is there - in the combox directory; it should contain 'Peaceful too.' in the - last line. - - - (lyra) append ``I've a dream'' to \verb+walden.pond+. - - - (lyra) sync Google Drive using \verb+rclone+. - - - (grus) sync Google Drive using \verb+rclone+. - - - (grus) verify that the latest copy of \verb+walden.pond+ is there - in the combox directory; it should contain ``I've a dream'' in the - last line. - - - (grus) remove \verb+walden.pond+ from combox directory. - - - (grus) sync Google Drive using \verb+rclone+. - - - (lyra) sync Google Drive using \verb+rclone+. - - - (lyra) verify that \verb+walden.pond+ is removed from the combox - directory. - - - (grus) open dropbox and Google drive accounts from the web browser. - - - (lyra) create file \verb+manufacturing.consent.+ with content ``Chomsky stuff?''. - - - (lyra) sync Google Drive using \verb+rclone+. - - - (grus) sync Google Drive using \verb+rclone+. - - - (grus) verify that \verb+manufacturing.consent+ was created in the - combox directory. - - - (grus) verify that the shards of \verb+manufacturing.consent+ were - created on Dropbox and Google Drive through the web browser. - -\subsection{Testing on a GNU/Linux and an OS X machine} - -combox was run on a GNU/Linux machine and an OS X machine and a file -was alternatively created/modified/renamed/deleted on one of the -machine and it was verified if the respective file was also -created/modified/renamed/deleted on the other machine. The GNU/Linux -machine was a virtual machine (\verb+lyra+) running Debian GNU/Linux -stable; the OS X machine was on Mavericks (10.9) during the initial -stage of testing, later it was upgraded to Yosemite (10.10). The node -directories to scatter files' shards were the Dropbox directory and -the Google Drive directory. The official Dropbox client was used to -automatically sync files from the Dropbox directory to the Dropbox' -server on both the GNU/Linux machine and the OS X machine; the -official Google Drive client was used to automatically sync files from -the Google Drive directory to Google Drive' server on OS X and -\verb+rclone+\cite{program:rclone} was used to sync files from the -Google Drive directory to Google Drive's server on GNU/Linux. - -\subsubsection{Issues found} - -\begin{itemize} -\item When a file was modified on another computer, on this computer - combox assumed that first shard (shard0) will be updated first and - also counted on the existence of the first shard (shard0). It was - observed that the order in which the shards were updated were - unpredictable on this computer and if the first shard (shard0) was - stored in the Dropbox directory, it will momentarily disappear - before the most updated shard becomes available in the Dropbox - directory; this broke combox. This issue was fixed on - 2015-08-25\cite{git:bug-four-fix}. This issue is not got to do with - the nature of the setup but it is related to the Dropbox's behavior - elaborated in section \ref{ch-5-2gnus-issues}. -\item The official Google Drive client when it pulls an updated - version of the file from Google Drive' server, instead directly - updating the respective file on the computer, it deletes the older - version of the file and creates the latest version of the file at - the respective location in the Google Drive directory; this behavior - of the Google Drive confused and broke combox. This issue was fixed - 2015-09-06 by making combox under the official Google Client's - behavior\cite{git:bug-googledc-fix}. -\item When a non-empty directory was move/renamed on another computer, - the old directory was not getting properly deleted on this computer; - this was happening because the files under the directory being - renamed were not deleted when it was time for \verb+NodeDirMonitor+ - to \verb+rmdir+ the old directory. This issue again is not specific - to the nature of the setup but was found while testing combox on - this setup. This issue was fixed on - 2015-09-12\cite{git:bug-six-fix}. -\item It was found that \verb+combox.file.rm_path+ function failed - when it was given a non-existent path to remove; this issue was - fixed on 2015-09-12\cite{git:bug-seven-fix}. -\end{itemize} - -\subsubsection{Demo} - -Demo of combox being used on a GNU/Linux machine and OS X machine can -be viewed at \url{https://ricketyspace.net/combox/combox-gnu-osx.webm} - -\verb+lyra+ is the GNU/Linux (virtual) machine and -\verb+dhcp-129-1-66-1+ is the OS X machine that is being used for the -demo. The OS X machine is accessed through VNC\cite{article:vnc}. - -Description of what happens in the demo follows: - - - (\verb+lyra+) create file \verb+cat.stevens+ with content ``peace train''. - - - (\verb+lyra+) sync Google Drive using \verb+rclone+. - - - (\verb+dhcp-129-1-66-1+) verify that file \verb+cat.stevens+ is - created with content ``peace train''. - - - (\verb+dhcp-129-1-66-1+) append string ``moonshadow'' to file - \verb+cat.stevens+. - - - (\verb+lyra+) sync Google Drive using \verb+rclone+. - - - (\verb+lyra+) verify that the file \verb+cat.stevens+ was updated - (modified); last line must have the string ``moonshadow''. - - - (\verb+lyra+) append string ``father and son'' to the file - \verb+cat.stevens+. - - - (\verb+lyra+) sync Google Drive using \verb+rclone+. - - - (\verb+dhcp-129-1-66-1+) verify that the file \verb+cat.stevens+ - was updated (modified); last line must have the string ``father and - son''. - - - (\verb+dhcp-129-1-66-1+) rename file \verb+cat.stevens+ to - \verb+yusuf.islam+ - - - (\verb+lyra+) sync Google Drive using \verb+rclone+. - - - (\verb+lyra+) verify that the file \verb+cat.stevens+ was renamed - to \verb+yusuf.islam+. - -\subsection{Testing with a USB stick as a node} - -combox was run on a GNU/Linux machine and an OS X machine and a file -was alternatively created/modified/deleted on one of the machine and -it was verified if the repsective file was also -create/modified/deleted on the other machine. The GNU/Linux machine -was a physical machine (\verb+grus+) running Debian GNU/Linux stable; -The OS X machine was on Mavericks (10.9). The node directories to -scatter files' shards were the Dropbox directory, Google Drive -directory and the USB stick (\verb+ZAPHOD+, FAT filesystem). The -official Dropbox client was used to automatically sync files from -Dropbox directory to Dropbox' server on both the GNU/Linux machine and -OS X machine; the official Google Drive client was used to -automatically sync files from the Google Drive directory to Google -Drive' server on OS X and \verb+rclone+\cite{program:rclone} was used -to sync files from the Google Drive directory to Google Drive's server -on GNU/Linux; the same USB stick (\verb+ZAPHOD+) was used on bothe -GNU/Linux and Dropbox to store the third shard (shard2) of a file. - -\subsubsection{Caveats} - -\begin{itemize} -\item When a removable USB disk is used as a node, combox must be - turned off before ejecting/unmounting the USB disk; combox does not - expect a node directory to disappear when it is running, if the USB - disk is removed when combox is running, then combox goes to a - undefined state. - -\item When a file modified on machine A is synced to machine B, combox - must be turned on first before turning on Dropbox and Google Drive - clients and the shard in the USB disk needs to be ``touched'' for - combox to detect that the file was modified on the remote computer - and update the file locally on this machine. - -\item File rename/move does not work. To make it work, core - functionality of combox must be re-written. -\end{itemize} - -\subsubsection{Demo} - -Demo of combox being used with a USB stick as the third node can be -view at \url{https://ricketyspace.net/combox/combox-usb-node-demo.webm} - -\verb+grus+ is the GNU/Linux machine and \verb+dhcp-129-1-66-1+ is the -OS X machine that is being used for the demo. \verb+ZAPHOD+ is the -FAT32 USB stick used as the third node. - -Description of what happens in the demo follows: - - - (\verb+grus+) start combox. - - - (\verb+grus+) create a file called \verb+simon.and.garfunkel+ with - content ``the boxer''. - - - (\verb+grus+) sync Google Drive using \verb+rclone+. - - - (\verb+grus+) stop combox. - - - (\verb+grus+) unmount USB stick (\verb+ZAPHOD+) from \verb+grus+. - - - (\verb+dhcp-129-1-66-1+) mount USB stick (\verb+ZAPHOD+) to - (\verb+dhcp-129-1-66-1+). - - - (\verb+dhcp-129-1-66-1+) start Dropbox client. - - - (\verb+dhcp-129-1-66-1+) start Google Drive client. - - - (\verb+dhcp-129-1-66-1+) start combox. - - - (\verb+dhcp-129-1-66-1+) verify that the file - \verb+simon.and.garfunkel+ with content ``the boxer'' was created. - - - (\verb+dhcp-129-1-66-1+) append string ``mrs. robinson'' to file - \verb+simon.and.garfunkel+. - - - (\verb+dhcp-129-1-66-1+) stop combox. - - - (\verb+dhcp-129-1-66-1+) stop Google Drive client. - - - (\verb+dhcp-129-1-66-1+) stop Dropbox client. - - - (\verb+dhcp-129-1-66-1+) unmount the USB stick (\verb+ZAPHOD+) - from (\verb+dhcp-129-1-66-1+). - - - (\verb+grus+) mount the USB stick (\verb+ZAPHOD+) to - (\verb+grus+). - - - (\verb+grus+) start combox. - - - (\verb+grus+) start Dropbox client. - - - (\verb+grus+) sync Google Drive using \verb+rclone+. - - - (\verb+grus+) touch \verb+simon.and.garfunkel.shard2+ in the USB - stick (\verb+ZAPHOD+). - - - (\verb+grus+) verify that the file \verb+simon.and.garfunkel+ is - updated; the last line must contain the string ``mrs. robinson''. - - - (\verb+grus+) remove the file \verb+simon.and.garfunkel+. - - - (\verb+grus+) sync Google Drive using \verb+rclone+. - - - (\verb+grus+) unmount the USB stick (\verb+ZAPHOD+) from - (\verb+grus+). - - - (\verb+grus+) stop Dropbox client. - - - (\verb+dhcp-129-1-66-1+) mount the USB stick (\verb+ZAPHOD+) to - (\verb+dhcp-129-1-66-1+). - - - (\verb+dhcp-129-1-66-1+) start Google Drive client. - - - (\verb+dhcp-129-1-66-1+) start Dropbox client. - - - (\verb+dhcp-129-1-66-1+) start combox. - - - (\verb+dhcp-129-1-66-1+) verify that the file - \verb+simon.and.garfunkel+ was deleted. - - -\section{Stress testing} - -Large number of files of different sizes were dumped to the combox -directory between an one second interval to see how combox responds to -high load. The file dump size was varied from \verb+424.798190MiB+ (27 -files) to \verb+10800.000000MiB+ (180 files); the average time taken -to split a file and the total time to process all files were -calculated for each dump. - -Stress testing was first done on \verb+2015-11-08+. In mid November -the \verb+ComboxDirMonitor+ was drastically modified to make it use -the file Lock shared the instances of -\verb+NodeDirMonitor+\cite{git:bug-eleven-fix}; my hunch was that this -change in \verb+ComboxDirMonitor+ directly affected the performance of -combox and therefore the results that were got from stress testing on -\verb+2015-11-08+ would no longer be valid. Stress testing was again -done on \verb+2016-01-16+; the results of this stress test are in -sections \ref{5-st-424} to \ref{5-st-10800}, section \ref{5-st-tu} -gives information about the tools used for stress testing, section -\ref{5-st-o} contains the observations and comparisons between this -stress test and the one done on \verb+2015-11-08+, lastly section -\ref{5-st-if} reveals the issues that were found with combox by virtue -of doing the stress tests. - -\subsection{flac dump (27 files - 424.798190MiB)}\label{5-st-424} - -\begin{center} -\begin{tabular}{ll} -field & value\\ -\hline -delay between a file dump & 1s\\ -start time of processing & 11:00:54\\ -end time of processing & 11:01:38\\ -total time taken to process all files & 00:00:44\\ -no. of files & 27\\ -total size of all files & 445433187.000000 bytes (424.798190MiB)\\ -avg. file size & 16497525.000000 bytes (15.733266MiB)\\ -avg. time to split and encrypt a file & 352.583370 ms\\ -\end{tabular} -\end{center} - -\subsubsection{Differences from previous stress test (2015-11-08)} - -\begin{itemize} -\item Total time to process all files was faster by 1min3secs. -\item Average time to split and encrypt a file reduced by - 28.337963000000002ms. -\end{itemize} - -\subsection{20MiB - 90MiB dump (27 files - 1620.000000MiB)}\label{5-st-1620} - -\begin{center} -\begin{tabular}{ll} -field & value\\ -\hline -delay between a file dump & 1s\\ -start time of processing & 12:26:45\\ -end time of processing & 12:29:07\\ -total time taken to process all files & 00:02:22\\ -no. of files & 27\\ -total size of all files & 1698693120.000000 bytes (1620.000000MiB)\\ -avg. file size & 62914560.000000 bytes (60.000000MiB)\\ -avg. time to split and encrypt a file & 2670.596556ms\\ -\end{tabular} -\end{center} - -\subsubsection{Differences from previous stress test (2015-11-08)} - -\begin{itemize} -\item Total time to process all files was slower by 4secs. -\item Average time to split and encrypt a file reduced by - 25.52536999999984ms. -\end{itemize} - -\subsection{20MiB - 90MiB dump (99 files - 5940.000000MiB)}\label{5-st-5940} - -\begin{center} -\begin{tabular}{ll} -field & value\\ -\hline -delay between a file dump & 1s\\ -start time of processing & 13:10:16\\ -end time of processing & 13:19:26\\ -total time taken to process all files & 00:09:10\\ -no. of files & 99\\ -total size of all files & 6228541440.000000 bytes (5940.000000MiB)\\ -avg. file size & 62914560.000000 bytes (60.000000MiB)\\ -avg. time to split and encrypt a file & 2979.647586ms\\ -\end{tabular} -\end{center} - -\subsubsection{Differences from previous stress test (2015-11-08)} - -\begin{itemize} -\item Total time to process all files was faster by 59secs. -\item Average time to split and encrypt a file increased by - 206.20906100000002ms. -\end{itemize} - -\subsection{20MiB - 90MiB dump (180 files - 10800.000000MiB)}\label{5-st-10800} - -\begin{center} -\begin{tabular}{ll} -field & value\\ -\hline -delay between a file dump & 1s\\ -start time of processing & 13:42:06\\ -end time of processing & 14:00:10\\ -total time taken to process all files & 00:18:04\\ -no. of files & 180\\ -total size of all files & 11324620800.000000 bytes (10800.000000MiB)\\ -avg. file size & 62914560.000000 bytes (60.000000MiB)\\ -avg. time to split and encrypt a file & 3423.087539ms\\ -\end{tabular} -\end{center} - -\subsubsection{Differences from previous stress test (2015-11-08)} - -\begin{itemize} -\item Total time to process all files was slower by 1min2secs -\item Average time to split and encrypt a file increased by - 399.87623299999996ms. -\end{itemize} - -\subsection{Tools used}\label{5-st-tu} - -The \verb+dump+ script\cite{program:dump} was used to dump files to -the combox directory between one second intervals; a night of Emacs -Lisp indulgence made it possible to quickly slurp the required data -from the combox output and calculate the average time to split and -encrypt a file and the total amount of time taken to process the files -for a given dump\cite{program:dumps.el}; lastly \verb+org-mode+ was -used to document all data gathered during stress -testing\cite{doc:benchmarks.org}. - -\subsection{Observations}\label{5-st-o} - -\begin{figure}[h] -\centering -\input{graphs/tot-time.tex} -\caption{time to process all files} -\label{fig:5-st-tt} -\end{figure} - -\begin{figure}[h] -\centering -\input{graphs/avg-time-sae.tex} -\caption{avg. time to split and encrypt} -\label{fig:5-st-atsae} -\end{figure} - - -\begin{itemize} -\item Figure \ref{fig:5-st-tt} shows the time it takes combox to - process files for a given file dump\footnote{A ``file dump'' here - means a bunch of files copied to the combox directory between 1 - sec intervals.}. As can be observed from the graph, the total time - taken to process all the files tends almost linearly increase with - the increase in the size of the file dump\footnote{The ``size of the - file dump'' is the total size of all files in a given file dump.}. -\item Figure \ref{fig:5-st-atsae} show the average time it takes - combox to split and encrypt a file for a given file dump. There is a - steep increase in the average time from the \verb+424.798190MiB+ - dump and the \verb+1620.000000MiB+ dump, after which the average - time to split and encrypt a file seems to almost linearly increase; - The main reason for this is that the average file size for dumps - from \verb+1620.000000MiB+ to \verb+10800.000000MiB+ are the same. -\end{itemize} - -\begin{figure}[h] -\centering -\input{graphs/tot-time-diff.tex} -\caption{time to process all files - difference between 2015 and 2016} -\label{fig:5-st-tt-diff} -\end{figure} - -\begin{figure}[h] -\centering -\input{graphs/avg-time-sae-diff.tex} -\caption{avg. time to split and encrypt - difference between 2015 and 2016} -\label{fig:5-st-atsae-diff} -\end{figure} - -\begin{itemize} -\item Figure \ref{fig:5-st-tt-diff} shows the graphs for the total - amount of time taken to process all files for a given file dump in - the \verb+2016-01-16+ and \verb+2015-11-8+ stress test. The amount - of time needed to process all fills seems to be reduced for the - \verb+5940.000000MiB+ file dump when compared to the \verb+2015+ - stress test results and it seems to be slightly higher for the - \verb+10800.000000MiB+ file dump when compared to the \verb+2015+ - stress test. -\item Similarly, figure \ref{fig:5-st-atsae-diff} shows the graphs for - the average time to split and encrypt for a given file dump in the - \verb+2016-01-16+ and the \verb+2015-11-8+ stress test. The average - time taken seems to able almost the same for the - \verb+424.798190MiB+ and the \verb+1620.000000+ dump, but for the - \verb+5940.000000MiB+ and the \verb+10800.000000MiB+ dump the - average time taken seems to higher for the \verb+2016+ stress test - when compared to the \verb+2015+ stress test. -\end{itemize} - -\subsection{Issues found}\label{5-st-if} - -\begin{itemize} -\item Initially when combox was stress tested with huge files, combox - would get overwhelmed leading to the computer running out of memory - and the load average sometimes peaking at \verb+8+. At first, it was - assumed that there was a bug in combox which caused this to happen, - but later it was found that \verb+watchdog+\cite{pylib:watchdog} was - generating a large number ``file modified'' events when a huge file - (\verb+~500MiB+ was modified). To prevent \verb+watchdog+ from - generating a large number ``file modified'' events for a single - modification of a huge file, a delay proportional to the size of the - file was created in the \verb+on_modified+ callback methods in both - \verb+ComboxDirMonitor+ and - \verb+NodeDirMonitor+\cite{git:bug-ten-fix}, this fixed the - issue. Also, this it might be useful to note here that this was - ``the'' hardest issue I dealt with in working on combox. -\end{itemize}- \ No newline at end of file