Muhammad Abdullah's Blog - technology as i C it

I completed my Masters in Computer Science from the University of Texas at Dallas in 2011. My concentration was Intelligent Systems. Generally i am very much interested in anything related to technology but software architectures, data stores, machine learning and operating systems are my favorites. When i am free, i listen to music, play sports, watch soccer or movies. I will share my experiences, my knowledge and my findings on this blog.

Sunday, October 7, 2012

Logging Input/Output of Apache2

If you want to log all input received and output sent by Apache to its error.log then you are on the right post. We will use "mod_dumpio" which allows us to log the input/output of Apache to its error log. Below are the steps you can follow in order to achieve the desired logging:

Enable mod_dumpio
To enable mod_dumpio module for apache using the following command:

sudo a2enmod dump_io

Module configuration

Next step is to configure this module. To do this open apache's configuration file "apache2.conf":

sudo nano /etc/apache2/apache2.conf

Now add the following configuration options in the configuration file:

DumpIOInput On

DumpIOOutput On

DumpIOLogLevel debug

DumpIOInput enables Apache's input logging whereas DumpIOOutput enables it's output logging. DumpIOLogLevel specifies the level of information to be logged. You can find all levels by taking a look at this link --> http://httpd.apache.org/docs/2.2/mod/core.html#loglevel

Restart Apache

Now you need to restart apache using the following command:

sudo service apache2 restart

sudo /etc/init.d/apache2 restart

Now you can open up the apache error log or tail on it to see Apache input and output logs. Use the following command to tail on the log:

sudo tail -f /var/log/apache2/error.log

Note:

If for some reason you do not see apache's input/output logs in the error.log file, you may want to look at any other config file that might set apache's log level. For example in my case I had a sites enabled module in apache turned on and I had a site configured for that. The location was "/etc/apache2/sites-enabled/xyz". Where xyz may be your site name. Open this file and see if the LogLevel set in this file is different from the DumpIOLogLevel that you just specified for the dump_io module. The values should be the same or you wont see any logging.

Wednesday, July 25, 2012

Installing Percona Server 5.5 on Ubuntu 10.04 Lucid

Installation

Debian and Ubuntu packages from Percona are signed with a key. So before using the repository, you should add the key to apt. To do that, run the following commands:
gpg --keyserver hkp://keys.gnupg.net --recv-keys 1C4CBDCDCD2EFD2A
gpg -a --export CD2EFD2A | sudo apt-key add -
Add the following lines to '/etc/apt/sources.list':

deb http://repo.percona.com/apt lucid main
deb-src http://repo.percona.com/apt lucid main

If you are using some other distribution, then you can substitute your distribution name with 'lucid'. To get the name of your distribution you can run the following command:

cat /etc/*-release

This command will show you the distribution information. Your distribution name will be the value of 'DISTRIB_CODENAME'

Now update the local cache:
apt-get update

To install Percona Server 5.5 use the command:

sudo apt-get install percona-server-server-5.5

Kudos! Percona Server 5.5 will now be installed.

Note: Percona Server 5.5 does not come with a default configuration file (i.e my.cnf). To figure out where you can put my.cnf file your need to run the following commands:
which mysqld --> /usr/sbin/mysqld
/usr/sbin/mysqld/ --verbose --help | grep -A 1 'Default options'
The output will be some lines. Among those lines you will see text like 'Default options are read from the following files in the given order: /etc/my.cnf /etc/mysql/my.cnf ........'. You can put the 'my.cnf' file in any of these directories but make sure that there is no config file in the directories that will be looked up before the directory you choose, else your my.cnf file will not be read. I personally prefer putting the 'my.cnf' file in '/etc/mysql/my.cnf'

In order to force MySQL not to run automatically when the server starts run the following command:

sudo update-rc.d -f mysql remove
You will now need to start MySQL manually when you start or reboot your server.

Some post-installation notes and points

Using a custom data directory and log directory for MySQL:

I generally put MySQL data files on a RAID-10 array and MySQL log files on a logical volume (No RAID configuration), so I explicitly mention data and log directories in the 'my.cnf' file. Below is how i specify MySQL data directory and MySQL log file directory:

Under [mysqld]

# data file directory
datadir = /var/mysql-data

#log file directories
log_error = /var/mysql-logs/mysql-error.log
slow_query_log = 1
slow_query_log_file = /var/mysql-logs/mysql-slow.log
general_log = 1

general_log_file = /var/mysql-logs/mysql-query.log

Now if you do something like this, you need to keep somethings in mind.

If you change the 'datadir' to a location other than MySQL's default data directory, you need to move folders 'mysql' and 'performance_schema' from '/var/lib/mysql/' to your new data directory. You can use the following command:

sudo mv /var/lib/mysql/mysql /var/mysql-data

sudo mv /var/lib/mysql/performance_schema /var/mysql-data

Note: Make sure the folder '/var/mysql-data' or whatever folder you are using as MySQL data folder has the right owner. That is owner should me 'mysql' and owner group should be 'mysql' too. You can change the owner using the following command:

sudo chown mysql:mysql /var/mysql-data

The same instructions apply to MySQL log folder, in-case you decide to use a different folder to store the log files.

Using a custom location to store 'socket' and 'pid_file'

Suppose you want to specify a custom location to store MySQL's socket and pid like in the configurations below:

Under [mysqld]

socket = /path/to/mysql/mysql.sock
pid_file = /path/to/mysql/mysql.pid
Under [client]

socket = /var/lib/mysql/mysql.sock

Make sure you have the right ownership of the folder containing both the files (mysql:mysql). Secondly, you also need to alter the 'debian.cnf' file which you can find in '/etc/mysql/debian.cnf'. Open 'debian.cnf' file and change the 'socket' location to the location you set in 'my.cnf'. If you do not do that, MySQL fails to start and stop.

Start/Stop/Restart MySQL

To start/stop/restart MySQL using the following command:
service mysql [start or stop or restart]
or
/etc/init.d/mysql [start or stop or restart]

Monday, May 28, 2012

Running Ubuntu Server in full screen mode - VirtualBox

I wanted to have Ubuntu Server in Full Screen mode or at-least wanted to increase its resolution while I was running server instances on Virtualbox. Below are the steps I took to change the resolution:

Open "grub.cfg" file. You can find this file in "/boot/grub/grub.cfg"

sudo vim /boot/grub/grub.cfg

Change the following assignments:

Change "set gfxmode=640x480" to "set gfxmode=1024x768"

Change "set gfxpayload=640x480" to "set gfxpayload= 1024x768"

Reboot your Ubuntu Server

Note: You can change the resolution from 1024x768 to any desirable resolution. I just used it as an example.

Wednesday, May 9, 2012

Amazon EC2 & Scalr – Roles, Instances, Regions, Availibility Zones & ELB

In this blog I will briefly describe some terminologies that you will frequently encounter when deploying an application on Amazon EC2 and while using Scalr for application management in Amazon EC2. I will then describe how these objects work together in harmony. To gain a better understanding of their working we need to first understand what they mean.

Role

A role is a machine image and as the name indicates, it serves a specific purpose of an application in the cloud. Typically a role is an abstraction of an instance (defined next). A role helps in defining a template which consists of a set of installations, needed to fulfil a specific function of an application. For example, a typical application will have web servers, cache servers and data servers. All three can serve as a role (Apache2 + PHP + APC Role, Memcached Role and MySQL Role). Roles are generally assigned to a farm (a set of instances working together to accomplish a task) and have their own security groups. For example, an application role can be made public while caching and database roles should be kept private and internal to the network.

Instances

Unlike a Role, that does not have any physical existence, an Instance is a physical existence of a Role. There can be multiple instances running for a particular role. Roles are templates and Instances are actual implementations of those templates.

Regions

Amazon EC2 infrastructure is spread across the globle in different regions. These regions are geographically seperated and provide an opportunity to run an application in different regions thus making an application fault tolerant. Also application can serve requests to the clients from the closest region. Regions are completely isloated from each other. Following are the regions available in Amazon EC2:

US East (Northern Virginia)

US West (Oregon)

US West (Northern California)

EU (Ireland)

Asia Pacific (Singapore)

Asia Pacific (Tokyo)

South America (Sao Paulo)

Availibility Zones

Availibility Zones are locations within a Region where instances can run. They help in making instances in a region failure proof. We can run instances in a region in one or more availibility zones or distribute the instances equally among the availibility zones. Availibility Zones inside are region are connected to each other.

Elastic Load Balancer

Elastic Load Balance or ELB, as the name indicates distributes incoming traffic among many instances in availability zone or many availibility zones. ELB also checks for unheathy instances in an availability zone and routes the incoming traffic to healthy instances. ELB supports sticky sessions and has the ability to terminate SSL at the balancer level so the application servers do not need to perform SSL decryption. When you launch an ELB in a region, make sure that it routes the traffic to the availability zones that carry instances. By default an ELB will distribute traffic among all availability zones inside a region. Be sure to select only the availablity zones that carry instances, otherwise the application will face timeouts.

Sunday, March 18, 2012

Installing Sphinx 2.0.4 on Ubuntu 10.04 - Lucid

This blog post will help you install Sphinx 2.0.4 on Ubuntu 10.04.

About Sphinx

Sphinx is a distributed search engine for full text searches. While MySQL offers its own storage engine (MyISAM) for full text searches but its not easy to scale it. Sphinx has many other advantages like:

better indexing and searching speed
good relevance search
and most importantly better scalability

Sphinx has two parts:

indexer - This part indexes the data source by pulling information from it and then builds indexes.

searchd - This part serves search queries by looking up in the index created by the indexer.

Installing Sphinx

Following steps will help you install Sphinx successfully on your Ubuntu box:

First check if the dependencies are already installed. If not install them.

sudo apt-get install libmysql++-dev libmysqlclient15-dev checkinstall

Download Sphinx 2.0.4 in '/tmp'

cd /tmp

sudo wget http://sphinxsearch.com/files/sphinx-2.0.4-release.tar.gz

Unpack the 'tar.gz' file and install Sphinx

sudo tar -xzf sphinx-2.0.4-release.tar.gz

The above command will unpack the 'tar.gz' file. You can find the contents in 'sphinx-2.0.4-release' directory

cd sphinx-2.0.4-release

Make install

sudo ./configure

sudo make

sudo checkinstall

Note: You will be prompted to create a directory and set a description for the package. Also it will ask you some questions with default answers. Fill them as per your convenience.I had an error in package installation. On checking the log file I saw that specifying version was mandatory. So changed version to 2.0.4.

After installation, the package will be saved to:

/tmp/sphinx-2.0.4-release/sphinx-2.0.4_2.0.4-1_i386.deb

Make a new folder to keep the *.deb package:

sudo mkdir /home/[YOUR_USERNAME]/SphinxInstalls

Move the sphinx-2.0.4_2.0.4-1_i386.deb package from '/tmp/sphinx-2.0.4-release/sphinx-2.0.4_2.0.4-1_i386.deb' to '/home/[YOUR_USERNAME]/SphinxInstalls'

sudo mv /tmp/sphinx-2.0.4-release/sphinx-2.0.4_2.0.4-1_i386.deb /home/[YOUR_USERNAME]/SphinxInstalls

You can now delete the working folder and the tar.gz file :

sudo rm -r /tmp/sphinx-2.0.4-release

sudo rm /tmp/sphinx-2.0.4-release.tar.gz

Location of Configuration/Daemons/Documentation

You can find Sphinx Documentation in:

/usr/share/doc/sphinx-2.0.4

Sphinx Configurations are found in '/usr/local/etc/'. The configuration files that exists by default are:

example.sql

sphinx.conf.dist

sphinx-min.conf.dist

Note: Default Sphinx configuration is sphinx.conf. It is not created by default. So you can copy 'sphinx.conf.dist' as 'sphinx.conf':

sudo cp sphinx.conf.dist sphinx.conf

Sphinx Processes (indexer, searchd etc.) are found in '/usr/local/bin/'.

To remove Sphinx: sudo dpkg -r sphinxsearch-2.0.4
To install again use the package in '/home/[YOUR_USERNAME]/SphinxInstalls': sudo dpkg -i sphinx-2.0.4_2.0.4-1_i386.deb

You are done with the installation of Sphinx on your Ubuntu box. In my upcoming posts I will cover Sphinx configuration (configuring local indexes and distributed instances).
^^^^ Coming this weekend

Monday, March 12, 2012

Enabling mod_rewrite Module in Apache2

I came across this situation recently and thought of publishing it on my blog so others can find it useful.

Introduction:

'mod_rewrite' is a module in Apache that provides a rule-based rewriting engine to rewrite requested URLs. You can read more about it at Apache Module mod_rewrite.

Enabling mod_rewrite:

To enable mod_rewrite, use the following command:

sudo a2enmod rewrite

Now restart Apache:

sudo service apache2 restart (or) sudo /etc/init.d/apache2 restart

Sunday, January 22, 2012

MySQL InnoDB Indexes

MySQL Indexing – InnoDB Indexes

A few days back I was reading about MySQL indexes, more specifically InnoDB indexes, to better understand query performance and optimization. So I thought of sharing some information on this topic. Indexes are basically structures that help the database engine in finding (retreiving) the records faster. An opposite to index lookup is full scan. Think of a full table scan as going through all the rows in a table and selecting the right row.

A common example that is generally given when explaining indexes is a book's index. To look up for a topic in any book, you either look up for the topic in an index or scan the whole book page by page. Obviously if the book has less pages, it is viable to go page by page and scan for a topic but if the book has decent number of pages than using an index is a smart and efficient approach. Same is the case with database indexes.

InnoDB Indexes

MySQL InnoDB Index Structure uses B+ Tree structure to store its data. B+ Tree structure is a different topic. I will be writing a blog on how data in B+ Trees is organized and how insert, update and delete effects the tree

Clustered Index

Clustered Index is an approach to store data. Think of a Clustered Index as a tree structure (index), with data rows as leaves and primary key as nodes above the leaves. InnoDB clusters the data by primary key. Below are some points to remember regarding the Clustered Index:

As stated earlier, InnoDB clusters the data by primary key. If the table has a primary key, MySQL will use this primary key for Clustered Index.
If the table does not have a primary key, MySQL selects the first UNIQUE and NOT NULL index for Clustered Index.
If none of the above applies, MySQL will generate a hidden (6 byte) field that contain row IDs. MySQL will use this hidden field to cluster data (as Clustered Index).

As Clustered Index holds both the data and primary key on the same page, row access is faster because no additional disk I/O is needed. On the other hand, in case of MyISAM, an additional disk I/O is needed as index and data are not on the same page. Below are some points worth mentioning:

Insertion speeds depend on how data is inserted into the table. Insertions are fast if data is inserted in primary key order. A bad approach is to insert data randomly (with random keys) but a good approach is to have sequential keys (like AUTO_INCREMENT)
Updating primary key may not be a good idea as it forces each updated row to be moved to a different location. Moving rows to a new location may lead to page splits which causes a table to use more disk space.
Though Clustered Indexes are efficient in terms of retreival, defining a clustered key having many columns may be a disadvantage. It would be clear why its a disadvantage once you read about the secondary index.

Secondary Indexes

Secondary Indexes are also called non-clustered index. Unlike clustered index, they do not store row data as leaves but they store primary key (clustered index) as leaves. It is due to this reason that it is advised to keep primary key short. The size of primary key will effect the size of a secondary index. As far as lookups are concerned, secondary index look up requires two steps. One to get primary keys matching the secondary key lookup and after that fetching the actual data by looking up the primary keys that are fetched before, from the clustered index.

Tuesday, January 3, 2012

Architecture for Scaleable Resource Discovery (Part II)

This is the second part of this post. In Part I, I explained the problem, analysis, architecture and algorithms that were devised to solve the presented problem. In this part I will be presenting a strategy to test the architecture using Amazon EC2. The strategy includes the implementation of architecture using EC2 and then implementation of testing clusters to simulate thousands of requests per second to test the architecture.

Simulation Strategy

Region Design

In order to go about simulating our algorithm, we chose to use Amazon’s Elastic Compute Cloud (EC2) as a base framework for simulating regional data replication and distribution. Amazon’s EC2 service is already broken into seven regions (known as Availability Zones) which would allow us to partially test our geographically distributed Regional Minimum Spanning Tree building algorithm. Distributing data accesses across multiple servers within a cluster or ring can be handled by the Elastic Load Balancing (ELB) feature which also detects and reroutes traffic from unhealthy instances of the data to healthy instances until the unhealthy instances can be restored. The CloudWatch feature would allow us to monitor just how efficient our design is performing. The following diagram shows our planned usage of the Amazon EC2 platform in order to simulate our regional architecture:

Request for a resource when generated by the client is routed to an appropriate Resource Region (present in one Amazon EC2 Region). The request is routed to the region closest to the client's location. This is done automatically by Amazon. The request first lands on the Elastic Load Balancer (ELB) of the Resource Region. The ELB sends the resource request to one of the region servers. This forwarding of requests is based on the ELB's request distribution algorithm. The region servers are auto-scaled which means that the number of servers will increase or decrease as the resource requests increase or decrease. There can be different auto-scaling criteria like bandwidth, requests per second, idle CPU time, etc. Once the request reaches the region server, the server determines the cluster to which the request should be forwarded (based on the resource type). The region server then forwards the request to the ELB of the selected cluster. The cluster ELB then routes the request to one of the cluster servers. The cluster servers are auto-scaled just like the region servers. When the request is received by the cluster server, the cluster server uses consistent hashing to figure out what ring cluster to forward the request to. After identifying the ring cluster, the request is forwarded to the ring cluster ELB. This ELB then forwards the request to any MySQL ring server. MySQL ring server has a web service that takes the resource ID and looks for the resource in the database. It then returns the appropriate response (resourceLocation or notFound). Unlike the other servers (region and cluster), the ring servers are not auto-scaled. For ease of deployment and re-use, we will use server templates for Region Server, Cluster Server, Ring Server. The templates will contain all of the necessary server configuration. All we need to do to add an additional pre-configured server is spawn

another instance of a template. We can also clone the Ring Cluster (Farm Cloning) to create a new Ring Cluster when we need more rings, in the case of new resources being added to the system.

Security

All of the servers will be closed to public access. Only for the reason of region synchronization will the cluster servers be allowed to connect to the cluster server of the neighbor region.

Testing Clusters

Now that the regional architecture is set up, we also need to simulate millions of users accessing the data concurrently across all seven regions of the EC2 platform. The following diagram is an example of how we plan on testing our region architecture described above:

We will create a separate instance of the EC2 platform which will send requests to our regional architecture at a predefined rate. These instances will be known as the Tester instances. The Tester instances will take advantage of predefined tools and use our Tester template to spawn multiple instances of a tester for each region, which will allow us to scale up the number of requests and analyze the network traffic and bandwidth used. This approach allows us to create as many or as few requests per second as we want for testing purposes.