Data Mining

PHP CVS to MySQL

This PHP program is intended as a heavy duty automated tool for converting CSV (comma separated value) to a MySQL import file or querying it directly into a MySQL database. It can be used simply with a single static method call or with more flexibility and power by creating it as an object.

Download this CVS to MySQL from github.

It is not so trivial as one may first think to convert data from one format to another. To truly take advantage of MySQL querying of data in useful ways the data must be properly assigned a data type, such as INT, FLOAT, VARCHAR, TEXT, TIME and so on. The method implemented in this script to do this is REGEX. The regex pattern matching is leveraged to pigeon hole the data into the most appropriate type for MySQL to use. Every data entry must be scanned to detect if it is a certain type of data, a data type that maybe only integer numbers would of course become an INT type, but if even one of ten thousand entries has a decimal in it, then the entire set must become a float, double or numeric.

An INT simply converted to a VARCHAR is not very useful when trying to query data that should be an integer. The default regex rules file is “regex_mysql_data.txt” and has comment lines in it that start with the # character. You may want to go about modifying this file to fit your needs or to improve upon the matching capabilities.

Besides regex pattern matching, the data string length is also considered. This is done first to determine if that data should even be considered being compared to the regex pattern type. This is most useful for text based data to determine if it should be of types VARCHAR, TEXT, MEDIUMTEXT, LONGTEXT
Here is a simple static example to create a CSV to MySQL import.

CSVtoMySQL::ToHTML('test.csv');

All the static methods assume there is a header as the first row of the CSV file. This static method will try to detect a primary key, if it cannot determine a suitable primary key it will assign an INT at the beginning named ‘id’.

If you do not want to rely upon the auto detection of a primary key use this example:

CSVtoMySQL::ToHTMLMyKey ('test.csv', ‘MyID’);

Where “MyID” (optional) will become the name of the new primary key and no auto detection will be attempted.

-= THE STATIC METHODS =-

ToString – These static methods will display no output but only return the results as a string.

$string = CSVtoMySQL::ToString( $in_file [,$delim = ‘,’] )
$string = CSVtoMySQL::ToStringMyKey( $in_file [,$my_key = ‘id’ [,$delim = ‘,’]] )

ToFile – These methods will send the information to a file supplied as $out_file.
Null = CSVtoMySQL::ToFile( $in_file, $out_file, [$delim = ‘,’] )
Null = CSVtoMySQL::ToFileMyKey( $in_file, $out_file [,$my_key = ‘id’, [$delim = ‘,’]] )

ToScreen – These methods will print the mysql import information directly to the screen.

Null = CSVtoMySQL::ToScreen( $in_file [,$delim = ‘,’] )
Null = CSVtoMySQL::ToScreenMyKey( $in_file [,$my_key = ‘id’ [,$delim = ‘,’]] )

ToHTML – Like ToScreen methods but they will also add the HTML line break tag where the new line is.

Null = CSVtoMySQL::ToHTML( $in_file [,$delim = ‘,’] )
Null = CSVtoMySQL::ToHTMLMyKey( $in_file [,$my_key = ‘id’ [,$delim = ‘,’]] )

ToMySQL – These methods will use your mysql connection to send the mysql query directly to the database. You must have already connected to the mysql server and database before calling either of these methods.

Null = CSVtoMySQL::ToMySQL( $in_file [,$delim = ‘,’] )
Null = CSVtoMySQL::ToMySQLMyKey( $in_file [,$my_key = ‘id’ [,$delim = ‘,’]] )

-= CLASS USAGE =-

Creating a class object is more powerful then the static methods as there a lot of helper methods for fine tuning and debugging.

To create as an object:

$c2m = new CSVtoMySQL('test.csv');

//Then you can do something like:
$c2m->add_blank_tag('NA');
$c2m->add_blank_tag('M','PHONE');
$c2m->set_mysql_file(‘mymysql.sql’);
$c2m->detect_primary_key();
$c2m->to_file();

Here is another example where you export the CSV file and import directly to the mysql database.


<?php

require_once('CSVtoMySQL.php');

$sql = mysql_connect('xxx.xxx.xxx.xxx', 'user', 'password');
mysql_select_db('database',$sql);

$c2m = new CSVtoMySQL('test.csv');
$c2n->set_table_name(‘mytable’);
If($c2m->detect_primary_key() == false)
	{
	$c2m->add_primary_key(‘id’);
}
$c2m->to_mysql();

-= CLASS METHODS =-

The constructor:
__construct($csv, [$mysql = “mysql.sql” [,$hashead = true]])

This method loads the regex file and can be load a custom regex file.
Null = load_regex($regex_file = ”)

Reserved words are words that conflict with mysql syntax statements, such as VARCAR, INSERT, UPDATE, DATASE to prevent conflicts a rule file named “reserved_mysql_words.txt” is loaded and used to compare against the CSV header names. Any matches are renamed to prevent conflicts. You can override this file with your own using this method.
Null =load_reserved_words($f = ”)

Method to set the CSV file
Null = set_csv_file($file)

Method to set the path and name of the mysql output file, but only needed if actually creating an out file.
Null = set_mysql_file($file)

By default the CSV delimiter (data separator character) is comma “,” but there is an auto detect pass that will try and match with other common delimiters (such as |,tabs, spaces). If you need to set this manually use this method.
Null = set_delimiter($v)

Use this method to set the mysql table name, by default the table name is the name of the CSV file itself minus the extension.
Null = set_table_name($s)

When reading in CSV file line by line, the max length of each line is set to 0, which in PHP 5.1+ is unlimited to end of line. However, if you need to set this to a specific length use this method.
Null = set_max_line_length($v)

This method allows you to insert a new field that does not exist in the CSV file. $v is the name of the field, and the optional secondary value is the type which is defaulted to VARCHAR(255)
Null = add_field($v [,$type = ‘VARCHAR(255)’])

This method allows you to change the field name based on $n which can be an index number or name and $name is the new name to be given.
Bool = change_field_name($n,$name])

Use this method to set the primary key index. If $v is a number then the key is the field index, if a name it is matched against the header field name.
Bool = primary_key($v)

Like above method but only applies to the field name, not the index
Bool = primary_key_col_by_name($s)

Like above but only applies to setting the primary key by index, where the first field index = 0, not 1!
Bool = primary_key_col_by_number($n)

Add your own custom primary key with this method. This should be an INT as it will also be set to auto increment. Set the starting point of the auto increment public variable $user_primary_key_inc [ = 0]
Null = add_primary_key ([$name = ‘id’ [,$type = ‘INT’ [,$start_at = -1]]])

This method is used to try and detect which field in the CSV file should be used as the primary key. It begins with the first column and tries to match any INT or VARCHAR type that is all unique and contains no empty records. As soon as it finds one it sets that as the primary key. Also see notes in the “regex_mysql_data.txt” file. If $n is supplied it can either be a number which matches the index of the CSV column (where first column is 0, not 1) or the name of the actual column. This method retruns true if it was able to match a primary key, and false if it failed.
Bool = detect_primary_key($n = ”)

A helper method to test the types of fields detected
Null = print_types()

Same as above but outputs as HTML
Null = print_html_types()

The method to call for returning the results as a string.
String = to_string()

Send the output to the screen. I use it for when I am working in telnet or ssh
Null = to_screen()

Send the output like the to_screen() method but includes html breaks at the new line locations.
Null = to_html()

This method writes the output to a file, if you hadn’t already set the output file name you can supply it.
Bool = to_file([$file = ”])

This method sends the parsed CSV file directly to the MySQL database, you must have a connection already established (see usage above for an example.)
Bool = to_mysql()

Adds a blank tag identifier to the blank_tags array. Sometimes data will be in a CSV file that should be treated as if it were blank, such as with ‘NA’, ‘-‘, or the like. You can add global tag blanks with this method that cause this type of data to be ignored or treated as if it were empty. You can set the column field name here which apply the blank tag to just a specific column otherwise if blank it is treated globally against all columns.
Null = add_blank_tag($v [,$col = ”])

This method is ran automatically by several functions, but if you need to call it yourself you can. This method will attempt to determine the data type a column is using the “regex_mysql_data.txt” file and its rules.
Null = detect_types()

Used to try and detect if the CVS file contains a header. This is very problematic and not 100% accurate. By default the public variable $detect_header = false and must be set to true for this method to work. Otherwise it assumed there is a header. The method returns true if it detected a header and false if it did not.
Bool = detect_header($s)

-= ADDITIONAL CLASS HELPERS =-

CSVtoMySQL_DetectType is a class that is created and stored in the $regex_match_file array that contains the information from the “regex_mysql_data.txt” file.

CSVtoMySQL_FieldType is a class that is created and stored in the $fields array and contains information regard each CSV column and it’s fields.