シェルスクリプト while read でファイルを１行ずつ読み込む

お世話になります、株式会社エイトハンドレッド・テクノロジー本部の細江と申します。本記事では、シェルスクリプトの while 文と read コマンドを組み合わせて、ファイル内のデータを１行ずつ読み込む方法をご紹介します。

書式

while read ★変数名★
do
  実行されるコマンド等
done < 内部のデータを１行ずつ読み込むファイルの名称

# while ～ do を１行にまとめる場合は do の前に ; を記載
while read ★変数名★; do
  実行されるコマンド等
done < 内部のデータを１行ずつ読み込むファイルの名称

活用例

まず初めに while read で内部のデータを１行ずつ読み込むファイルを用意します。今回は、AWS CLI を用いて Amazon Athena クエリ情報のログファイルを作成しました。

for ExecutionId in `aws athena list-query-executions | jq -r '.QueryExecutionIds[]'`; do
   QueryExecution=`aws athena get-query-execution --query-execution-id $ExecutionId`
   echo $QueryExecution >> athena_query_executions.log
done

上記シェルスクリプトを実行すると、get-query-execution コマンド結果が１行ずつ格納された、以下のようなログファイルが出力されます。
※ 本記事での解説用に実際のコマンド結果を簡略化してあります¹。

{ "QueryExecution": { "QueryExecutionId": "800-eight-hundred", "QueryExecutionContext": { "Database": "Eight" }, "ResultConfiguration": { "OutputLocation": "s3://aws-athena-query-results-800/eight.txt" }, "Status": { "State": "RUNNING", "SubmissionDateTime": "2800-08-28T00:00:08+00:00" } } }
{ "QueryExecution": { "QueryExecutionId": "080-eight-hundred", "QueryExecutionContext": { "Database": "Hund" }, "ResultConfiguration": { "OutputLocation": "s3://aws-athena-query-results-080/hund.txt" }, "Status": { "State": "SUCCEEDED", "SubmissionDateTime": "2800-08-18T00:08:08+00:00" } } }
{ "QueryExecution": { "QueryExecutionId": "008-eight-hundred", "QueryExecutionContext": { "Database": "Red" }, "ResultConfiguration": { "OutputLocation": "s3://aws-athena-query-results-008/red.txt" }, "Status": { "State": "FAILED", "SubmissionDateTime": "2800-08-08T08:08:08+00:00" } } }

この athena_query_executions.log という名称のファイルから、 while read を活用して CSV ファイルを出力します。

# 出力する CSV ファイルのヘッダを作成します
echo '"QueryId","Database","OutputLocation","Status","DateTime"' > athena_query_executions.csv

# while read でデータを１行ずつ読み込み
# jq コマンドで各要素を抽出します
while read line; do
  QueryId=`echo $line | jq '.QueryExecution.QueryExecutionId'`
  Database=`echo $line | jq '.QueryExecution.QueryExecutionContext.Database'`
  OutputLocation=`echo $line | jq '.QueryExecution.ResultConfiguration.OutputLocation'`
  Status=`echo $line | jq '.QueryExecution.Status.State'`
  DateTime=`echo $line | jq '.QueryExecution.Status.SubmissionDateTime'`
  echo "$QueryId,$Database,$OutputLocation,$Status,$DateTime" >> athena_query_executions.csv
done < athena_query_executions.log

Amazon Athena のクエリ情報一覧を作成することができました。

QueryId	Database	OutputLocation	Status	DateTime
800-eight-hundred	Eight	s3://aws-athena-query-results-800/eight.txt	RUNNING	2800-08-28T00:00:08+00:00
080-eight-hundred	Hund	s3://aws-athena-query-results-080/hund.txt	SUCCEEDED	2800-08-18T00:08:08+00:00
008-eight-hundred	Red	s3://aws-athena-query-results-008/red.txt	FAILED	2800-08-08T08:08:08+00:00

for 文を使って複数ファイルを１行ずつ読み込む

for ★変数名１★ in ワイルドカード等を利用した複数ファイル名; do
  while read ★変数名２★; do
    実行されるコマンド等
  done
done

上記、活用例の athena_query_executions.log の他に、以下の athena_query_executions2.log というログファイルがカレントディレクトリにあったとします。

{ "QueryExecution": { "QueryExecutionId": "eight-hundred-800", "QueryExecutionContext": { "Database": "Eight" }, "ResultConfiguration": { "OutputLocation": "s3://aws-athena-query-results-eight/800.txt" }, "Status": { "State": "FAILED", "SubmissionDateTime": "2800-08-08T00:00:08+00:00" } } }
{ "QueryExecution": { "QueryExecutionId": "eight-hundred-080", "QueryExecutionContext": { "Database": "Hund" }, "ResultConfiguration": { "OutputLocation": "s3://aws-athena-query-results-hund/080.txt" }, "Status": { "State": "FAILED", "SubmissionDateTime": "2800-08-18T00:08:08+00:00" } } }
{ "QueryExecution": { "QueryExecutionId": "eight-hundred-008", "QueryExecutionContext": { "Database": "Red" }, "ResultConfiguration": { "OutputLocation": "s3://aws-athena-query-results-red/008.txt" }, "Status": { "State": "SUCCEEDED", "SubmissionDateTime": "2800-08-28T08:08:08+00:00" } } }

athena_query_executions.log, athena_query_executions2.log 各ログファイルから Status が FAILED となっているレコードのみ抽出します。

for file in athena_query_executions*.log; do
  while read line; do
    QueryId=`echo $line | jq '.QueryExecution.QueryExecutionId'`
    Database=`echo $line | jq '.QueryExecution.QueryExecutionContext.Database'`
    OutputLocation=`echo $line | jq '.QueryExecution.ResultConfiguration.OutputLocation'`
    Status=`echo $line | jq '.QueryExecution.Status.State'`
    DateTime=`echo $line | jq '.QueryExecution.Status.SubmissionDateTime'`
    if [ "$Status" = '"FAILED"' ]; then
      echo "$QueryId,$Database,$OutputLocation,$Status,$DateTime"
    fi
  done < $file
done

for 文でカレントディレクトリのログファイルを読み込み、さらに当該ファイル内のデータを while read で１行ずつ読み込み。if 文で Status が FAILED となっているレコードのみ出力しています。

"eight-hundred-800","Eight","s3://aws-athena-query-results-eight/800.txt","FAILED","2800-08-08T00:00:08+00:00"
"eight-hundred-080","Hund","s3://aws-athena-query-results-hund/080.txt","FAILED","2800-08-18T00:08:08+00:00"
"008-eight-hundred","Red","s3://aws-athena-query-results-008/red.txt","FAILED","2800-08-08T08:08:08+00:00"

コマンドの実行結果を１行ずつ読み込む

実行結果を読み込ませるコマンド | while read ★変数名★; do
  実行されるコマンド等
done

上記の活用例で作成した athena_query_executions.csv の 1-2,4 列目だけを抽出してみます。

awk -F ',' '{print $1,$2,$4}' athena_query_executions.csv | while read line; do
  echo $line
done

awk
テキスト検索・抽出・加工といった編集操作を実行するコマンド。
-F で区切り文字 (,) を指定し、文字列を表示するアクション print で athena_query_executions.csv の 1-2,4 列目を抽出しています。

"QueryId" "Database" "Status"
"800-eight-hundred" "Eight" "RUNNING"
"080-eight-hundred" "Hund" "SUCCEEDED"
"008-eight-hundred" "Red" "FAILED"

最後に

キッカケは、活用例でご紹介しているような自社の Amazon Athena 利用状況を確認・分析したかったことでした。

最近は ETL ツールも充実してきておりますが、AWS CloudShell で簡単に Linux 環境を用意できるようになりましたし、この程度のデータ前処理であれば、シェルスクリプトの方が効率は良いと思います。

本記事が、データ分析に従事されているコーディングに不慣れな方の、シェルスクリプト活用の一助になりましたら嬉しいです！

実際には Query 要素内のエスケープ文字や SELECT * のアスタリスクによって意図しない結果になることがありますが、本記事の主旨ではないので、説明は割愛させていただきました。↩