Wednesday, August 5, 2015

SAP HANA - Preparing My Stock Market Data

1. Background

My original goal with setting up a SAP HANA One instance on AWS was to load some data and start playing with the analysis tools. I looked around but really didn't find anything interesting. What I have done instead is to build a database of historical market data that I can use to analyze stocks listed on the American stock exchanges. In this case, it is some older data available off of Yahoo.

The process below outlines my steps for building that data and importing it into a MariaDB database (an offshoot of the MySQL database). Luckily, I have used MySQL in the past and know, at least partly, my way around Python scripts so preparing and loading the data didn't take too long.

In this first phase of data collection I wanted to accomplish these steps:
  • Download the data and compile into a directory on a Linux system (in this case a CentOS 7 system hosted on AWS).
  • Import that data into a MariaDB database (cleaning up where necessary) so it can be imported into or is accessible by a HANA One system on AWS.

2. System Preparation

The first thing I did was to create some Python scripts based on what I found on Stackoverflow (the details are down below). I had to modify these since they were originally created for an windows system. Then I had to install MariaDB, add some Python libraries, and create a directory to hold all of the data. 

2.2 Installing MariaDB
CentOS is based on RedHat Linux 7.0 and so the install requirements are a little different. I followed the install steps on the MariaDB site.

Here is what I did:
- Run the command: 'yum install mariadb-server mariadb-client'
As you can see below yum included all the dependencies.

CentOS uses systemd to manage services so starting the services is handled differently from before (no more standard init.d directories). Here are the commands to enable and then start MariaDB:
systemctl enable mariadb
systemctl start mariadb

Once the database is up and running, the script 'mysql_secure_installation'
should be run to lock the system down and remove some demo data.

Here are the steps I followed:
- Run mysql_secure_installation as root (no password) and follow the directions as shown below.

2.3 Preparing Python
On this CentOS system, Python was already installed so I just need to update a few libraries (one for MySQL, one for the Python installer, and another for some web functionality). Here are the commands I used:
- yum install libmysqlclient-dev
- yum install MySQL-python 
- yum install  python-pip (this is for the Python installer)
- yum install beautifulsoup4

2.4 Preparing Directories
For the data directories, I created subdirectories under the default centos account. All I did was create a stockdata subdirectory and then another subdirectory named history (to hold a batch of download files)

 3. Downloading the Data

The data is pulled down from an API at Yahoo and a listing of stocks on Wikipedia. The data from Yahoo is not very recent but it should be enough to test the analysis functions in HANA.

The first data to retrieve are the listings from Wikipedia. For this I borrowed this script from Stockoverflow (a big thank you to user Nitay). This scrapes the webpage shown below and pulls the list of stocks out of the table (the process took only about a minute).

 I took the second script from Stockoverflow and updated it so I could pull down the market data from Yahoo and store it in the subdirectory I created above (I updated the history_file location below). This process took about 20 minutes to complete.

Here's a look at the contents of the 'history' directory. Each market symbol is downloaded into a csv file.

4. Loading MariaDB 

Now that I have the data, I need to load it into a database so it can be retrieved from HANA (or cleaned and transferred somewhere else).

I created a new database by creating and running this Python script (I put this in a script in case I need to repeat the process for some other system).

The next script is more complicated. It needs to create and populate two new tables. The part of the script shown here creates the two tables and the populates the symbol table using the data pulled down from Wikipedia.

 The second part of the Python script uses the table created at the top as the basis for importing data from the history directory. It take each record from that first table and generates a file name. If that CSV file exists, then it proceeds with stepping through the file and populating the symdata table.

This took over an hour to run (I probably could have made it faster by not issue a commit for each record) but I was able to pull in almost all of the files into the database (quite a few symbols have changed with mergers and so on).

So I now have some nice centralized data for me to play with. My next step is to setup access to/from my SAP HANA One AWS test system and start playing with the data there. 

No comments: